On Elasticsearch, how to aggregate based on the number of items in a field? - elasticsearch

On Elasticsearch I have a field named Itinerary that can contain multiple values (from 1 up to 6), for example in the picture below there's 2 items in the field.
"Itinerary": [
{
"Carrier": "LH",
"Departure": "2021-07-04T06:55:00Z",
"Number": "1493",
"Arrival": "2021-07-04T08:40:00Z",
},
{
"Carrier": "LH",
"Departure": "2021-07-04T13:30:00Z",
"Number": "422",
"Arrival": "2021-07-04T16:05:00Z",
}
}
]
Is there a way I can aggregate based on the number of items in the field? Having something like:
1 item : 2
2 item : 4
...

Itinerary type needs to be define as nested type
"Itinerary":
{
"type": "nested"
}
Terms aggregation to group on a field. You can use script to get count of array or better introduce a field which has count of array
Top hits aggregation to get documents under that group
{
"size": 0,
"aggs": {
"NAME": {
"terms": {
"script": {
"source": "doc['Itinerary.Carrier.keyword'].length"
}
},
"aggs": {
"NAME": {
"top_hits": {
"size": 10
}
}
}
}
}
}
Result:
"aggregations" : {
"NAME" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 2,
"doc_count" : 2,
"NAME" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index8",
"_type" : "_doc",
"_id" : "8OW1lnsBRh1xpgSkIOlq",
"_score" : 1.0,
"_source" : {
"Itinerary" : [
{
"Carrier" : "LH",
"Departure" : "2021-07-04T06:55:00Z",
"Number" : "1493",
"Arrival" : "2021-07-04T08:40:00Z"
},
{
"Carrier" : "LH",
"Departure" : "2021-07-04T13:30:00Z",
"Number" : "422",
"Arrival" : "2021-07-04T16:05:00Z"
}
]
}
},
{
"_index" : "index8",
"_type" : "_doc",
"_id" : "8uW6lnsBRh1xpgSkAun1",
"_score" : 1.0,
"_source" : {
"Itinerary" : [
{
"Carrier" : "LH2",
"Departure" : "2021-07-04T06:55:00Z",
"Number" : "14931",
"Arrival" : "2021-07-04T08:40:00Z"
},
{
"Carrier" : "LH2",
"Departure" : "2021-07-04T13:30:00Z",
"Number" : "4221",
"Arrival" : "2021-07-04T16:05:00Z"
}
]
}
}
]
}
}
},
{
"key" : 3,
"doc_count" : 1,
"NAME" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index8",
"_type" : "_doc",
"_id" : "8eW1lnsBRh1xpgSkdukQ",
"_score" : 1.0,
"_source" : {
"Itinerary" : [
{
"Carrier" : "LH1",
"Departure" : "2021-07-04T06:55:00Z",
"Number" : "14931",
"Arrival" : "2021-07-04T08:40:00Z"
},
{
"Carrier" : "LH1",
"Departure" : "2021-07-04T13:30:00Z",
"Number" : "4221",
"Arrival" : "2021-07-04T16:05:00Z"
},
{
"Carrier" : "LH1",
"Departure" : "2021-07-04T13:30:00Z",
"Number" : "3221",
"Arrival" : "2021-07-04T16:05:00Z"
}
]
}
}
]
}
}
}
]
}
}

Related

Aggregating multiple values of single key into a single bucket elasticsearch

I have a elastic search index with following mapping
{
"probe_alert" : {
"mappings" : {
"alert" : {
"properties" : {
"id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"probeChannelId" : {
"type" : "long"
},
"severity" : {
"type" : "integer"
},
}
}
}
}
}
Sample indexed data : For each channel index has a severity value
[
{
"_index" : "probe_alert",
"_type" : "alert",
"_id" : "b_cu0nYB8EMvknGcmMxk",
"_score" : 0.0,
"_source" : {
"id" : "b_cu0nYB8EMvknGcmMxk",
"probeChannelId" : 15,
"severity" : 2,
}
},
{
"_index" : "probe_alert",
"_type" : "alert",
"_id" : "b_cu0nYB8EMvknGcmMxk",
"_score" : 0.0,
"_source" : {
"id" : "b_cu0nYB8EMvknGcmMxk",
"probeChannelId" : 17,
"severity" : 5,
}
},
{
"_index" : "probe_alert",
"_type" : "alert",
"_id" : "b_cu0nYB8EMvknGcmMxk",
"_score" : 0.0,
"_source" : {
"id" : "b_cu0nYB8EMvknGcmMxk",
"probeChannelId" : 18,
"severity" : 10,
}
},
{
"_index" : "probe_alert",
"_type" : "alert",
"_id" : "b_cu0nYB8EMvknGcmMxk",
"_score" : 0.0,
"_source" : {
"id" : "b_cu0nYB8EMvknGcmMxk",
"probeChannelId" : 19,
"severity" : 5,
}
},
{
"_index" : "probe_alert",
"_type" : "alert",
"_id" : "b_cu0nYB8EMvknGcmMxk",
"_score" : 0.0,
"_source" : {
"id" : "b_cu0nYB8EMvknGcmMxk",
"probeChannelId" :20,
"severity" : 10,
}
}
]
I have done terms aggregation for fetching max severity value for a single probeChannelId but now I want to aggregate on multiple values of probeChannelId and get max value of severity.
Expected Result :
"aggregations" : {
"aggs_by_channels" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : [15,17],
"doc_count" : 1,
"aggs_by_severity" : {
"value" : 5.0
}
},
{
"key" : [18,19,20],
"doc_count" : 1,
"aggs_by_severity" : {
"value" : 10.0
}
}
]
}
}
In response i want group of values probeChannelId to have highest severity value
If you want to get the highest severity value, among a set of documents, then you can try out the below query using the Adjacency matrix aggregation
Search Query:
{
"size": 0,
"aggs": {
"interactions": {
"adjacency_matrix": {
"filters": {
"[15,17]": {
"terms": {
"probeChannelId": [
15,
17
]
}
},
"[18,19,20]": {
"terms": {
"probeChannelId": [
18,
19,
20
]
}
}
}
},
"aggs": {
"max_severity": {
"max": {
"field": "severity"
}
}
}
}
}
}
Search Result:
"aggregations": {
"interactions": {
"buckets": [
{
"key": "[15,17]",
"doc_count": 2,
"max_severity": {
"value": 5.0 // note this
}
},
{
"key": "[18,19,20]",
"doc_count": 3,
"max_severity": {
"value": 10.0 // note this
}
}
]
}

ElasticSearch - Filter Buckets

My elasticSearch query is like:
{
"size": 0,
"aggs": {
"group_by_id": {
"terms": {
"field": "Infos.InstanceInfo.ID.keyword",
"size": 1000
},
"aggs": {
"tops": {
"top_hits": {
"size": 100,
"sort": {
"Infos.InstanceInfo.StartTime": "asc"
}
}
}
}
}
}
}
It works fine, I have a result of this form:
aggregations
=========>group_by_id
==============>buckets
{key:id1}
===============>docs
{doc1.Status:"KO"}
{doc2.Status:"KO"}
{key:id2}
===============>docs
{doc1.Status:"KO"}
{doc2.Status:"OK"}
{key:id3}
===============>docs
{doc1.Status:"KO"}
{doc2.Status:"OK"}
I'm trying to add a filter, so when "OK" the result must be like this:
aggregations
=========>group_by_id
==============>buckets
{key:id2}
===============>docs
{doc1.Status:"KO"}
{doc2.Status:"OK"}
{key:id3}
===============>docs
{doc1.Status:"KO"}
{doc2.Status:"OK"}
and for "KO" :
aggregations
=========>group_by_id
==============>buckets
{key:id1}
===============>docs
{doc1.Status:"KO"}
{doc2.Status:"KO"}
Fields "Startime" & "Status" are at the same level "Infos.InstanceInfo.[...]"
Any idea?
EDIT
Sample docs:
{
"took" : 794,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 143846,
"buckets" : [
{
"key" : "1000",
"doc_count" : 6,
"tops" : {
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "azerty",
"_type" : "_doc",
"_id" : "vHFvoXYBVWrYChNi7hB7",
"_score" : null,
"_source" : {
"Infos" : {
"InstanceInfo" : {
"ID" : "1000",
"StartTime" : "2020-12-27T00:43:56.011+01:00",
"status" : "KO"
}
}
},
"sort" : [
1609026236011
]
},
{
"_index" : "azerty",
"_type" : "_doc",
"_id" : "xHFvoXYBVWrYChNi7xAB",
"_score" : null,
"_source" : {
"Infos" : {
"InstanceInfo" : {
"ID" : "1000",
"StartTime" : "2020-12-27T00:43:56.145+01:00",
"status" : "OK"
}
}
},
"sort" : [
1609026236145
]
},
{
"_index" : "azerty",
"_type" : "_doc",
"_id" : "xXFvoXYBVWrYChNi7xAC",
"_score" : null,
"_source" : {
"Infos" : {
"InstanceInfo" : {
"ID" : "1000",
"StartTime" : "2020-12-27T00:43:56.147+01:00",
"status" : "OK"
}
}
},
"sort" : [
1609026236147
]
},
{
"_index" : "azerty",
"_type" : "_doc",
"_id" : "x3FvoXYBVWrYChNi7xAs",
"_score" : null,
"_source" : {
"Infos" : {
"InstanceInfo" : {
"ID" : "1000",
"StartTime" : "2020-12-27T00:43:56.188+01:00",
"status" : "OK"
}
}
},
"sort" : [
1609026236188
]
},
{
"_index" : "azerty",
"_type" : "_doc",
"_id" : "yHFvoXYBVWrYChNi7xAs",
"_score" : null,
"_source" : {
"Infos" : {
"InstanceInfo" : {
"ID" : "1000",
"StartTime" : "2020-12-27T00:43:56.19+01:00",
"status" : "OK"
}
}
},
"sort" : [
1609026236190
]
},
{
"_index" : "azerty",
"_type" : "_doc",
"_id" : "ynFvoXYBVWrYChNi7xBd",
"_score" : null,
"_source" : {
"Infos" : {
"InstanceInfo" : {
"ID" : "1000",
"StartTime" : "2020-12-27T00:43:56.236+01:00",
"status" : "OK"
}
}
},
"sort" : [
1609026236236
]
}
]
}
}
},
{
"key" : "2000",
"doc_count" : 2,
"tops" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "azerty",
"_type" : "_doc",
"_id" : "7HL_onYBVWrYChNij4Is",
"_score" : null,
"_source" : {
"Infos" : {
"InstanceInfo" : {
"ID" : "2000",
"StartTime" : "2020-12-27T08:00:26.011+01:00",
"status" : "KO"
}
}
},
"sort" : [
1609052426011
]
},
{
"_index" : "azerty",
"_type" : "_doc",
"_id" : "9HL_onYBVWrYChNij4Kz",
"_score" : null,
"_source" : {
"Infos" : {
"InstanceInfo" : {
"ID" : "2000",
"StartTime" : "2020-12-27T08:00:26.146+01:00",
"status" : "KO"
}
}
},
"sort" : [
1609052426146
]
}
]
}
}
},
{
"key" : "3000",
"doc_count" : 6,
"tops" : {
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "azerty",
"_type" : "_doc",
"_id" : "7nNRpHYBVWrYChNiiruh",
"_score" : null,
"_source" : {
"Infos" : {
"InstanceInfo" : {
"ID" : "3000",
"StartTime" : "2020-12-27T14:09:36.015+01:00",
"status" : "KO"
}
}
},
"sort" : [
1609074576015
]
},
{
"_index" : "azerty",
"_type" : "_doc",
"_id" : "9nNRpHYBVWrYChNii7s5",
"_score" : null,
"_source" : {
"Infos" : {
"InstanceInfo" : {
"ID" : "3000",
"StartTime" : "2020-12-27T14:09:36.166+01:00",
"status" : "OK"
}
}
},
"sort" : [
1609074576166
]
},
{
"_index" : "azerty",
"_type" : "_doc",
"_id" : "93NRpHYBVWrYChNii7s5",
"_score" : null,
"_source" : {
"Infos" : {
"InstanceInfo" : {
"ID" : "3000",
"StartTime" : "2020-12-27T14:09:36.166+01:00",
"status" : "OK"
}
}
},
"sort" : [
1609074576166
]
},
{
"_index" : "azerty",
"_type" : "_doc",
"_id" : "-XNRpHYBVWrYChNii7ti",
"_score" : null,
"_source" : {
"Infos" : {
"InstanceInfo" : {
"ID" : "3000",
"StartTime" : "2020-12-27T14:09:36.209+01:00",
"status" : "OK"
}
}
},
"sort" : [
1609074576209
]
},
{
"_index" : "azerty",
"_type" : "_doc",
"_id" : "-nNRpHYBVWrYChNii7ts",
"_score" : null,
"_source" : {
"Infos" : {
"InstanceInfo" : {
"ID" : "3000",
"StartTime" : "2020-12-27T14:09:36.219+01:00",
"status" : "OK"
}
}
},
"sort" : [
1609074576219
]
},
{
"_index" : "azerty",
"_type" : "_doc",
"_id" : "_HNRpHYBVWrYChNii7ud",
"_score" : null,
"_source" : {
"Infos" : {
"InstanceInfo" : {
"ID" : "3000",
"StartTime" : "2020-12-27T14:09:36.269+01:00",
"status" : "OK"
}
}
},
"sort" : [
1609074576269
]
}
]
}
}
}
]
}
}
}
Assuming the status field is under Infos.InstanceInfo and it's of the keyword mapping, you can utilize the filter aggregation:
{
"size": 0,
"aggs": {
"status_KO_only": {
"filter": { <--
"term": {
"Infos.InstanceInfo.Status": "KO"
}
},
"aggs": {
"group_by_id": {
"terms": {
"field": "Infos.InstanceInfo.ID.keyword",
"size": 1000
},
"aggs": {
"tops": {
"top_hits": {
"size": 100,
"sort": {
"Infos.InstanceInfo.StartTime": "asc"
}
}
}
}
}
}
}
}
}
In this particular case you could've applied the same term query in the query part of the search request without having to use a filter aggregation.
If you want to get both OK and KO in the same request, you can copy/paste the whole status_KO_only aggregation, rename the 2nd one, and voila -- you now have both groups in one request. You can of course have as many differently named (top-level) filter aggs as you like.
Now, when you indeed need multiple filter aggs at once, there's a more elegant way that does not require copy-pasting -- enter the filters aggregation:
{
"size": 0,
"aggs": {
"by_statuses": {
"filters": { <--
"filters": {
"status_KO": {
"term": {
"Infos.InstanceInfo.Status": "KO"
}
},
"status_OK": {
"term": {
"Infos.InstanceInfo.Status": "OK"
}
}
}
},
"aggs": {
"group_by_id": {
"terms": {
"field": "Infos.InstanceInfo.ID.keyword",
"size": 1000
},
"aggs": {
"tops": {
"top_hits": {
"size": 100,
"sort": {
"Infos.InstanceInfo.StartTime": "asc"
}
}
}
}
}
}
}
}
}
Any of the child sub-aggregations will automatically be the buckets of the explicitly declared term filters.
I personally find the copy/paste approach more readable, esp. when constructing such requests dynamically (based on UI dropdowns and such.)

How can i extend an elastic search date range histogram aggregation query?

Hi I have an elastic search index named mep-report.
Each document has a status field. The possible values for status fields are "ENROUTE", "SUBMITTED", "DELIVERED", "FAILED" . Below is the sample elastic search index with 6 documents.
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 13,
"successful" : 13,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1094313,
"max_score" : 1.0,
"hits" : [
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837500",
"_score" : 1.0,
"_source" : {
"status" : "ENROUTE",
"#timestamp" : "2019-09-11T10:21:26.000Z"
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837501",
"_score" : 1.0,
"_source" : {
"status" : "ENROUTE",
"#timestamp" : "2019-09-11T10:21:26.000Z"
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837502",
"_score" : 1.0,
"_source" : {
"status" : "SUBMITTED",
"#timestamp" : "2019-09-11T10:21:26.000Z"
}
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837503",
"_score" : 1.0,
"_source" : {
"status" : "DELIVERED",
"#timestamp" : "2019-09-11T10:21:26.000Z"
}
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837504",
"_score" : 1.0,
"_source" : {
"status" : "FAILED",
"#timestamp" : "2019-09-11T10:21:26.000Z"
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837504",
"_score" : 1.0,
"_source" : {
"status" : "FAILED",
"#timestamp" : "2019-09-11T10:21:26.000Z"
}
}
}
I would like to find an aggregation histogram distribution something like to get messages_processed, message_delivered,messages_failed .
messages_processed : 3 ( 2 documents in status ENROUTE + 1 Document with status SUBMITTED )
message_delivered 1 ( 1 document with status DELIVERED )
messages_failed : 2 ( 2 documents with status FAILED )
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 13,
"successful" : 13,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 21300,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"performance_over_time" : {
"buckets" : [
{
"key_as_string" : "2020-02-21",
"key" : 1582243200000,
"doc_count" : 6,
"message_processed": 3,
"message_delivered": 1,
"message_failed": 2
}
]
}
}
}
So the following is my current query and i would like to modify it to get some additional statistics such as message_processed , message_delivered, message_failed. kindly let me know .
{ "size": 0, "query": { "bool": { "must": [ { "range": { "#timestamp": { "from": "2020-02-21T00:00Z", "to": "2020-02-21T23:59:59.999Z", "include_lower": true, "include_upper": true, "format": "yyyy-MM-dd'T'HH:mm:ss.SSSZ ||yyyy-MM-dd'T'HH:mmZ", "boost": 1.0 } } } ], "adjust_pure_negative": true, "boost": 1.0 } }, "aggregations": { "performance_over_time": { "date_histogram": { "field": "#timestamp", "format": "yyyy-MM-dd", "interval": "1d", "offset": 0, "order": { "_key": "asc" }, "keyed": false, "min_doc_count": 0 } } } }
You are almost there with the query, you just need to add Terms Aggregation and looking at your request, I've come up with a Scripted Terms Aggregation.
I've also modified the date histogram aggregation field interval to calendar_interval so that you get the values as per the calendar date.
Query Request:
POST <your_index_name>/_search
{
"size": 0,
"query":{
"bool":{
"must":[
{
"range":{
"#timestamp":{
"from":"2019-09-10",
"to":"2019-09-12",
"include_lower":true,
"include_upper":true,
"boost":1.0
}
}
}
],
"adjust_pure_negative":true,
"boost":1.0
}
},
"aggs":{
"message_processed":{
"date_histogram": {
"field": "#timestamp",
"calendar_interval": "1d" <----- Note this
},
"aggs": {
"my_messages": {
"terms": {
"script": { <----- Core Logic of Terms Agg
"source": """
if(doc['status'].value=="ENROUTE" || doc['status'].value == "SUBMITTED"){
return "message_processed";
}else if(doc['status'].value=="DELIVERED"){
return "message_delivered"
}else {
return "message_failed"
}
""",
"lang": "painless"
},
"size": 10
}
}
}
}
}
}
Note that the core logic what you are looking for is inside the scripted terms aggregation. Logic is self explainable if you go through it. Feel free to modify the logic that fits you.
For the sample date you've shared, you would get the result in the below format:
Response:
{
"took" : 144,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"message_processed" : {
"buckets" : [
{
"key_as_string" : "2019-09-11T00:00:00.000Z",
"key" : 1568160000000,
"doc_count" : 6,
"my_messages" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "message_processed",
"doc_count" : 3
},
{
"key" : "message_failed",
"doc_count" : 2
},
{
"key" : "message_delivered",
"doc_count" : 1
}
]
}
}
]
}
}
}

GET TOP HIT FROM A VALUE IF THIS IS 0 KIBANA

My first post, I spend the weekend looking for an answer without a good result
I will try to explain my issue, I have this Index
ST ID
0 1
1 1
0 2
1 2
0 2
1 3
0 3
For example, I need to show the last records from each ID when them are 0, for example, in this index I have to show only ID 1 and ID 2, becuase the last record has ST to 0 in ID 1 and 2
Could some try to help me with this issue?
BR
Mapping:
PUT index34
{
"mappings": {
"properties": {
"ST":{
"type": "integer"
},
"ID":{
"type": "integer"
},
"Date":{
"type": "date"
}
}
}
}
Data:
[
{
"_index" : "index34",
"_type" : "_doc",
"_id" : "LO7Z7W0B_-hMjUaqtwHw",
"_score" : 1.0,
"_source" : {
"ST" : 1,
"ID" : 1,
"Date" : "2019-10-21T12:00:00Z"
}
},
{
"_index" : "index34",
"_type" : "_doc",
"_id" : "Le7Z7W0B_-hMjUaq0QEz",
"_score" : 1.0,
"_source" : {
"ST" : 0,
"ID" : 1,
"Date" : "2019-10-21T12:01:00Z"
}
},
{
"_index" : "index34",
"_type" : "_doc",
"_id" : "Lu7a7W0B_-hMjUaqAwE0",
"_score" : 1.0,
"_source" : {
"ST" : 1,
"ID" : 2,
"Date" : "2019-10-21T12:02:00Z"
}
},
{
"_index" : "index34",
"_type" : "_doc",
"_id" : "L-7a7W0B_-hMjUaqGAEr",
"_score" : 1.0,
"_source" : {
"ST" : 0,
"ID" : 2,
"Date" : "2019-10-21T12:04:00Z"
}
},
{
"_index" : "index34",
"_type" : "_doc",
"_id" : "MO7a7W0B_-hMjUaqNAGA",
"_score" : 1.0,
"_source" : {
"ST" : 0,
"ID" : 3,
"Date" : "2019-10-21T12:04:00Z"
}
},
{
"_index" : "index34",
"_type" : "_doc",
"_id" : "Me7a7W0B_-hMjUaqTQFP",
"_score" : 1.0,
"_source" : {
"ST" : 1,
"ID" : 3,
"Date" : "2019-10-21T12:06:00Z"
}
}
]
Query: I am getting max date for all terms and then getting the max value when ST was zero. If these two match(which means 0 was latest document) then I am keeping tha bucket
GET index34/_search
{
"size": 0,
"aggs": {
"ID": {
"terms": {
"field": "ID",
"size": 10000
},
"aggs": {
"maxDate": {
"max": {
"field": "Date"
}
},
"pending_status": {
"filter": {
"term": {
"ST": 0
}
},
"aggs": {
"filtered_maxdate": {
"max": {
"field": "Date"
}
}
}
},
"buckets_latest_status_pending": {
"bucket_selector": {
"buckets_path": {
"filtereddate": "pending_status>filtered_maxdate",
"maxDate": "maxDate"
},
"script": "params.filtereddate==params.maxDate"
}
}
}
}
}
}
Response:
"aggregations" : {
"ID" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1,
"doc_count" : 2,
"pending_status" : {
"doc_count" : 1,
"filtered_maxdate" : {
"value" : 1.57165926E12,
"value_as_string" : "2019-10-21T12:01:00.000Z"
}
},
"maxDate" : {
"value" : 1.57165926E12,
"value_as_string" : "2019-10-21T12:01:00.000Z"
}
},
{
"key" : 2,
"doc_count" : 2,
"pending_status" : {
"doc_count" : 1,
"filtered_maxdate" : {
"value" : 1.57165944E12,
"value_as_string" : "2019-10-21T12:04:00.000Z"
}
},
"maxDate" : {
"value" : 1.57165944E12,
"value_as_string" : "2019-10-21T12:04:00.000Z"
}
}
]
}

Fetching unique data in Elasticsearch

I have following data
ID: 1, fldname: pawan
ID: 1, fldname: pawan1
ID: 1, fldname: pawan2
ID: 2, fldname: pawan3
ID: 3, fldname: pawan4
ID: 4, fldname: pawan5
I am trying to get unique data based on ID field, similar to what we get in MySQL while firing group by queries like:
select * from table_name where fldname like 'pawan%' group by ID
This will return unique values. Same works in sphinx search when we use group by function.
Is there any way to get unique values in elasticsearch..?
Below is my sample mapping:
"mappings": {
"my_type": {
"properties": {
"docid": {
"type": "keyword"
},
"flgname": {
"type": "text"
}
}
}
}
I suggest that you slightly modify your mapping:
{
"record" : {
"dynamic" : "false",
"_all" : {
"enabled" : false
},
"properties" : {
"docid" : {
"type" : "long"
},
"flgname" : {
"type" : "text"
}
}
}
}
so that docid is a long
Then you could try fuzzy queries for filtering, together with aggregations, like this one here which retrieves the minimum, maximum, average and count of docid's:
{
"from" : 0,
"size" : 10,
"_source" : true,
"query" : {
"bool" : {
"must" : [ {
"match" : {
"flgname" : {
"query" : "pawan",
"operator" : "OR",
"fuzziness" : "1",
"prefix_length" : 1,
"max_expansions" : 50,
"fuzzy_transpositions" : true,
"lenient" : false,
"zero_terms_query" : "NONE",
"boost" : 1.0
}
}
} ]
}
},
"aggs" : {
"my_cardinality" : {
"cardinality" : {
"field" : "docid"
}
},
"my_avg" : {
"avg" : {
"field" : "docid"
}
},
"my_min" : {
"min" : {
"field" : "docid"
}
},
"my_max" : {
"max" : {
"field" : "docid"
}
}
}
}
By the way this is the result of the above query on the data you proposed:
{
"took" : 47,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 6,
"max_score" : 0.9808292,
"hits" : [ {
"_index" : "stack_overflow1",
"_type" : "record",
"_id" : "40b5eac0-743b-4a6a-a06d-3ae4d56f4aca",
"_score" : 0.9808292,
"_source" : {
"docid" : "1",
"flgname" : "pawan"
}
}, {
"_index" : "stack_overflow1",
"_type" : "record",
"_id" : "27821c39-e722-4361-bc07-0dcd5181a1ad",
"_score" : 0.7846634,
"_source" : {
"docid" : "2",
"flgname" : "pawan3"
}
}, {
"_index" : "stack_overflow1",
"_type" : "record",
"_id" : "86fcd9c1-a688-4a6a-9c45-e91791a8b902",
"_score" : 0.7846634,
"_source" : {
"docid" : "4",
"flgname" : "pawan5"
}
}, {
"_index" : "stack_overflow1",
"_type" : "record",
"_id" : "fb00a3cc-f1b8-4073-8808-f2ddbc4979e2",
"_score" : 0.55451775,
"_source" : {
"docid" : "1",
"flgname" : "pawan1"
}
}, {
"_index" : "stack_overflow1",
"_type" : "record",
"_id" : "18e5e20d-17a7-4d59-b2f1-7bf325a4c4df",
"_score" : 0.55451775,
"_source" : {
"docid" : "3",
"flgname" : "pawan4"
}
}, {
"_index" : "stack_overflow1",
"_type" : "record",
"_id" : "fbf49af6-f574-4ad2-8686-cbbedc5e70c4",
"_score" : 0.23014566,
"_source" : {
"docid" : "1",
"flgname" : "pawan2"
}
} ]
},
"aggregations" : {
"my_cardinality" : {
"value" : 4
},
"my_max" : {
"value" : 4.0
},
"my_avg" : {
"value" : 2.0
},
"my_min" : {
"value" : 1.0
}
}
}
If you make flgname also a keyword, then you can use sub-aggregation to aggregate over docID and subaggregate over flgname. Result will be similar to the SQL query you mentioned.
Query would look like:
{ "size": 0,
"query": {
"regexp":{
"flgname": "pawa.*"
}
},
"aggs" : {
"docids": {
"terms": {"field": "docid"},
"aggs": { "flgnam": { "terms": {"field": "flgname"}}}}
}
}

Resources