How to get last and first document ids by given criteria - elasticsearch

I have some documents indexed on Elasticsearch, looking like these samples:
{"id":"1","isMigrated":true}
{"id":"2","isMigrated":true}
{"id":"3","isMigrated":false}
{"id":"4","isMigrated":false}
how can i get in one query the last migrated id and first not migrated id?
Any ideas?

Filter aggregation and top_hits aggregation can be used to get last migrated and first not migrated
{
"size": 0,
"aggs": {
"migrated": {
"filter": { --> filter where isMigrated:true
"term": {
"isMigrated": true
}
},
"aggs": {
"last_migrated": { --> get first documents sorted on id in descending order
"top_hits": {
"size": 1,
"sort": [{"id.keyword":"desc"}]
}
}
}
},
"not_migrated": {
"filter": {
"term": {
"isMigrated": false
}
},
"aggs": {
"first_not_migrated": {
"top_hits": {
"size": 1,
"sort": [{"id.keyword":"asc"}] -->any keyword field can be used to sort
}
}
}
}
}
}
Result:
"aggregations" : {
"not_migrated" : {
"doc_count" : 2,
"first_not_migrated" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "index86",
"_type" : "_doc",
"_id" : "TxuKUHIB8mx5yKbJ_rGH",
"_score" : null,
"_source" : {
"id" : "3",
"isMigrated" : false
},
"sort" : [
"3"
]
}
]
}
}
},
"migrated" : {
"doc_count" : 2,
"last_migrated" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "index86",
"_type" : "_doc",
"_id" : "ThuKUHIB8mx5yKbJ87HF",
"_score" : null,
"_source" : {
"id" : "2",
"isMigrated" : true
},
"sort" : [
"2"
]
}
]
}
}
}
}

You can store the timestamp information with each document and query based on the latest timestamp and isMigrated: true condition.
As per comment, https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html can you used to combine multiple boolean conditions.

Related

is there a way of showing documents after a sum aggregation?

I've been trying lately to retrieve information about sales on Kibana DSL.
I've been told to show vendors information PLUS their monthly sales.
(I'll use the "Kibana_sample_data_ecommerce" for this example)
I already did this aggregation in order to group all clients by their 'customer_id':
#Aggregations (group by)
GET kibana_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"by user_id": {
"terms": {
"field": "customer_id"
},
"aggs": {
"add_field_to_bucket": {
"top_hits": {"size": 1, "_source": {"includes": ["customer_full_name"]}}
}
}
}
}
}
in which i've included customer_full_name in the result:
"aggregations" : {
"by user_id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 2970,
"buckets" : [
{
"key" : "27",
"doc_count" : 348,
"add_field_to_bucket" : {
"hits" : {
"total" : 348,
"max_score" : 1.0,
"hits" : [
{
"_index" : "kibana_sample_data_ecommerce",
"_type" : "_doc",
"_id" : "fhwUR3sBpfDKGuVlpu8r",
"_score" : 1.0,
"_source" : {
"customer_full_name" : "Elyssa Underwood"
}
}
]
}
}
}
So, in this result i know that 'Elyssa Underwood' with 'customerid' '27' has 348 hits (or documents related).
Also i recquire to know the total spent by 'Elyssa' on those products, using the field 'products.taxful_price'.
The thing is that i cannot perform a subaggregation on top_hits (as far as i know); Also I've tried to do a sum_aggregation, but it ends on the same result (i got my sum, but i cannot access top_hits sub aggregation at that point).
At the end of the day i want to have a result like this:
"hits" : [
{
"_index" : "kibana_sample_data_ecommerce",
"_type" : "_doc",
"_id" : "fhwUR3sBpfDKGuVlpu8r",
"_score" : 1.0,
"_source" : {
"customer_full_name" : "Elyssa Underwood",
"total_spent": 1234.5678
}
}
]
Is there something I can do to achieve it?.
PS: I'm using ElasticSearch 5.x and also I have access to NEST client, if there's a solution I can reach through it.
Thanks In Advance.
I have used below as sample data.
Data:
{
"customer_id":2,
"client-name":"b",
"purchase": 2001
}
Query:
GET index/_search
{
"size": 0,
"aggs": {
"NAME": {
"terms": {
"field": "customer_id",
"size": 10
},
"aggs": {
"total_sales": {
"sum": {
"field": "purchase"
}
},
"documents":{
"top_hits": {
"size": 10
}
}
}
}
}
}
Result:
{
"key" : 2,
"doc_count" : 1,
"documents" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index1",
"_type" : "_doc",
"_id" : "0HPzcHsBjw4ziwrzGzrq",
"_score" : 1.0,
"_source" : {
"customer_id" : 2,
"client-name" : "b",
"purchase" : 2001
}
}
]
}
},
"total_sales" : {
"value" : 2001.0
}
}

Elasticsearch DSL query

{
"name" : "Danny",
"id" : "123",
"lastProfileUpdateTime" : "2021-06-26T20:08:25.089Z"
},
{
"name" : "Harry",
"id" : "124",
"lastProfileUpdateTime" : "2021-04-12T20:08:25.089Z"
},
{
"name" : "Danny Brown",
"id" : "123",
"lastProfileUpdateTime" : "2021-07-26T20:08:25.089Z"
},
{
"name" : "Danny Smith",
"id" : "123",
"lastProfileUpdateTime" : "2021-08-26T20:08:25.089Z"
}
I have a usecase where i need to find if a particular id has been updated or not and filter out latest profile. In above case since id:123 has updated profile. so expected outcome should be:
{
"name" : "Danny Smith",
"id" : "123",
"lastProfileUpdateTime" : "2021-08-26T20:08:25.089Z"
}
If a id has more than one entry, pick the one which has latest lastProfileUpdateTime.
This can be done using
Terms aggregation
: to group by "id"
Bucket selector: to get ids where count > 1
Top_hits: To get latest document for given id
Query
{
"size": 0,
"aggs": {
"Updated_docs": {
"terms": {
"field": "id.keyword",
"size": 10
},
"aggs": {
"filter_count_more_than_one": {
"bucket_selector": {
"buckets_path": {
"count": "_count"
},
"script": "params.count>1"
}
},
"latest_document": {
"top_hits": {
"size": 1,
"sort": [
{
"lastProfileUpdateTime": "desc"
}
]
}
}
}
}
}
}
Result
"aggregations" : {
"Updated_docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "123",
"doc_count" : 3,
"latest_document" : {
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "index4",
"_type" : "_doc",
"_id" : "huE1i3sBZ5_aIkj8cvpR",
"_score" : null,
"_source" : {
"name" : "Danny Smith",
"id" : "123",
"lastProfileUpdateTime" : "2021-08-26T20:08:25.089Z"
},
"sort" : [
1630008505089
]
}
]
}
}
}
]
}
}

global sorting across different buckets after aggregation in elasticsearch

a sample in my document is as shown below.
{"rackName" : "rack005", "roomName" : "roomB", "power" : 132, "timestamp" : 1594540106208}
the thing I wanna do is get the latest data of each rack in a given room then sort them by power.
with the code below I did something to get close to my target.losing mind with the last step which seems like soring my data cross different buckets by field 'power'.
GET /power/_search
{
"query": {
"term": {
"roomName.keyword": {
"value": "roomB"
}
}
},
"aggs": {
"rk_ag": {
"terms": {
"field": "rackName"
},
"aggs": {
"latest": {
"top_hits": {
"sort": [
{
"timestamp": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}
-----------------------------------result-------------------------------------------------------
"aggregations" : {
"rk_ag" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "rack003",
"doc_count" : 4,
"latest" : {
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "power",
"_type" : "_doc",
"_id" : "0FXVQnMB8DPB7H9t6U0E",
"_score" : null,
"_source" : {
"rackName" : "rack003",
"roomName" : "roomB",
"power" : 115,
"timestamp" : 1594540117492
},
"sort" : [
1594540117492
]
}
]
}
}
},
{
"key" : "rack004",
"doc_count" : 4,
"latest" : {
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "power",
"_type" : "_doc",
"_id" : "1FXVQnMB8DPB7H9t6U0E",
"_score" : null,
"_source" : {
"rackName" : "rack004",
"roomName" : "roomB",
"power" : 108,
"timestamp" : 1594540117492
},
"sort" : [
1594540117492
]
}
]
}
}
},
{
"key" : "rack005",
"doc_count" : 4,
"latest" : {
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "power",
"_type" : "_doc",
"_id" : "2FXVQnMB8DPB7H9t6U0E",
"_score" : null,
"_source" : {
"rackName" : "rack005",
"roomName" : "roomB",
"power" : 118,
"timestamp" : 1594540114492
},
"sort" : [
1594540114492
]
}
]
}
}
}
]
}
}
You're sorting by timestamp instead of power. Try this instead:
GET /power/_search
{
"query": {
"term": {
"roomName.keyword": {
"value": "roomB"
}
}
},
"aggs": {
"rk_ag": {
"terms": {
"field": "rackName"
},
"aggs": {
"latest": {
"top_hits": {
"sort": [
{
"power": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}
You can sort by multiple fields too.
Adding to #Joe's answer. As he mentioned, you can use multiple fields in the sort.
Below query would give you what you are looking for:
POST my_rack_index/_search
{
"size": 0,
"query": {
"term": {
"roomName.keyword": {
"value": "roomB"
}
}
},
"aggs": {
"rk_ag": {
"terms": {
"field": "rackName"
},
"aggs": {
"latest": {
"top_hits": {
"sort": [ <---- Note this part
{
"timestamp": {
"order": "desc"
}
},
{
"power": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}
So now if for every rack you have two documents having same rackName with exact same power, the one with the latest timestamp would be showing up in the response.
The way sort would work is, first it would sort based on the timestamp, then it would do the sorting based on power by keeping the sort based on timestamp intact.

Search documents with highest fields

I'm trying to get all the documents with highest field value (+ conditional term filter)
Given the Employees mapping
Name Department Salary
----------------------------
Tomcat Dev 100
Bobcat QA 90
Beast QA 100
Tom Dev 100
Bob Dev 90
In SQL it would look like
select * from Employees where Salary = select max(salary) from Employees
expected output
Name Department Salary
----------------------------
Tomcat Dev 100
Beast QA 100
Tom Dev 100
and
select * from Employees where Salary = (select max(salary) from Employees where Department ='Dev' )
expected output
Name Department Salary
----------------------------
Tomcat Dev 100
Tom Dev 100
Is it possible with Elasticsearch ?
The below should help:
Looking at your data, note that I've come up with the below mapping:
Mapping:
PUT my-salary-index
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"department":{
"type": "keyword"
},
"salary":{
"type": "float"
}
}
}
}
Sample Documents:
POST my-salary-index/_doc/1
{
"name": "Tomcat",
"department": "Dev",
"salary": 100
}
POST my-salary-index/_doc/2
{
"name": "Bobcast",
"department": "QA",
"salary": 90
}
POST my-salary-index/_doc/3
{
"name": "Beast",
"department": "QA",
"salary": 100
}
POST my-salary-index/_doc/4
{
"name": "Tom",
"department": "Dev",
"salary": 100
}
POST my-salary-index/_doc/5
{
"name": "Bob",
"department": "Dev",
"salary": 90
}
Solutions:
Scenario 1: Return all employees with max salary
POST my-salary-index/_search
{
"size": 0,
"aggs": {
"my_employees_salary":{
"terms": {
"field": "salary",
"size": 1, <--- Note this
"order": {
"_key": "desc"
}
},
"aggs": {
"my_employees": {
"top_hits": { <--- Note this. Top hits aggregation
"size": 10
}
}
}
}
}
}
Note that I've made use of Terms Aggregation with Top Hits aggregation chained to it. I'd suggest to go through the links to understand both the aggregations.
So basically you just need to retrieve the first element in the Terms Aggregation that is why I've mentioned the size: 1. Also note the order, just in case if you requirement to retrieve the lowest.
Scenario 1 Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_employees" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 2,
"buckets" : [
{
"key" : 100.0,
"doc_count" : 3,
"employees" : {
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my-salary-index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"name" : "Tomcat",
"department" : "Dev",
"salary" : 100
}
},
{
"_index" : "my-salary-index",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"name" : "Beast",
"department" : "QA",
"salary" : 100
}
},
{
"_index" : "my-salary-index",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"name" : "Tom",
"department" : "Dev",
"salary" : 100
}
}
]
}
}
}
]
}
}
}
Scenario 2: Return all employee with max salary from particular department
POST my-salary-index/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"department": "Dev"
}
}
]
}
},
"aggs": {
"my_employees_salary":{
"terms": {
"field": "salary",
"size": 1,
"order": {
"_key": "desc"
}
},
"aggs": {
"my_employees": {
"top_hits": {
"size": 10
}
}
}
}
}
}
For this, there are many ways to do this, but the idea is that you basically filter the documents before you apply aggregation on top of it. That way it would be more efficient.
Note that I'v just added a bool condition to the aggregation query mentioned in solution for Scenario 1.
Scenario 2 Response
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_employees_salary" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 1,
"buckets" : [
{
"key" : 100.0,
"doc_count" : 2,
"my_employees" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.53899646,
"hits" : [
{
"_index" : "my-salary-index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.53899646,
"_source" : {
"name" : "Tomcat",
"department" : "Dev",
"salary" : 100
}
},
{
"_index" : "my-salary-index",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.53899646,
"_source" : {
"name" : "Tom",
"department" : "Dev",
"salary" : 100
}
}
]
}
}
}
]
}
}
}
You can also think of making use of SQL Access if you have complete xpack or rather licensed version of x-pack.
Hope this helps.

Elasticsearch, terms aggs according to sibling nested fields

Elasticsearch v7.5
Hello and good day!
We have 2 indices named socialmedia and influencers
Sample contents:
socialmedia:
{
'_id' : 1001,
'title' : "Title 1",
'smp_id' : 1,
"latest" : [
{
"soc_mm_score" : "5",
}
]
},
{
'_id' : 1002,
'title' : "Title 2",
'smp_id' : 2,
"latest" : [
{
"soc_mm_score" : "10",
}
]
},
{
'_id' : 1003,
'title' : "Title 3",
'smp_id' : 3,
"latest" : [
{
"soc_mm_score" : "35",
}
]
},
{
'_id' : 1004,
'title' : "Title 4",
'smp_id' : 2,
"latest" : [
{
"soc_mm_score" : "30",
}
]
}
//omitted some other fields
influencers:
{
'_id' : 1,
'name' : "John",
'smp_id' : 1
},
{
'_id' : 2,
'name' : "Peter",
'smp_id' : 2
},
{
'_id' : 3,
'name' : "Mark",
'smp_id' : 3
}
Now I have this simple query that determines which documents in the socialmedia index has the most latest.soc_mm_score value, and also displaying their corresponding influencers determined by the smp_id
GET socialmedia/_search
{
"size": 0,
"_source": "latest",
"query": {
"match_all": {}
},
"aggs": {
"LATEST": {
"nested": {
"path": "latest"
},
"aggs": {
"MM_SCORE": {
"terms": {
"field": "latest.soc_mm_score",
"order": {
"_key": "desc"
},
"size": 3
},
"aggs": {
"REVERSE": {
"reverse_nested": {},
"aggs": {
"SMP_ID": {
"top_hits": {
"_source": ["smp_id"],
"size": 1
}
}
}
}
}
}
}
}
}
}
SAMPLE OUTPUT:
"aggregations" : {
"LATEST" : {
"doc_count" : //omitted,
"MM_SCORE" : {
"doc_count_error_upper_bound" : //omitted,
"sum_other_doc_count" : //omitted,
"buckets" : [
{
"key" : 35,
"doc_count" : 1,
"REVERSE" : {
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1003",
"_score" : 1.0,
"_source" : {
"smp_id" : "3"
}
}
]
}
}
}
},
{
"key" : 30,
"doc_count" : 1,
"REVERSE" : {
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1004",
"_score" : 1.0,
"_source" : {
"smp_id" : "2"
}
}
]
}
}
}
},
{
"key" : 10,
"doc_count" : 1,
"REVERSE" : {
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1002",
"_score" : 1.0,
"_source" : {
"smp_id" : "2"
}
}
]
}
}
}
}
]
}
}
}
with the query above, I was able to successfully display which documents have the highest latest.soc_mm_score values
The sample output above only displays DOCUMENTS, telling that the influencers (a.k.a smp_id) related to them are the TOP INFLUENCERS according to latest.soc_mm_score
Ideally just by using this aggs query,
"terms" : {
"field" : "smp_id"
}
portrays the concept of which influencers are the top according to the doc_count
Now, displaying the terms query according to latest.soc_mm_score displays TOP DOCUMENTS
"terms" : {
"field" : "latest.soc_mm_score"
}
REAL OBJECTIVE:
I want to display the TOP INFLUENCERS according to the latest.soc_mm_count in the socialmedia index. If Elasticsearch can count all the documents where according to unique smp_id, is there a way for ES to sum all latest.soc_mm_score values and use it as terms?
My objective above should output these:
smp_id 2 as the Top Influencer because he has 2 posts (with soc_mm_score of 30 and 10), adding them gets him 40 soc_mm_score
smp_id 3 as the 2nd Top Influencer, he has 1 post with 35 soc_mm_score
smp_id 1 as the 3rd Top Influencer, he has 1 post with 5 soc_mm_score
Is there a proper query to meet this objective?
FINALLY! FOUND AN ANSWER!!!
"aggs": {
"INFS": {
"terms": {
"field": "smp_id.keyword",
"order": {
"LATEST > SUM_SVALUE": "desc"
}
},
"aggs": {
"LATEST": {
"nested": {
"path": "latest"
},
"aggs": {
"SUM_SVALUE": {
"sum" : {
"field": "latest.soc_mm_score"
}
}
}
}
}
}
}
Displays the following sample:

Resources