Elasticsearch DSL query - elasticsearch

{
"name" : "Danny",
"id" : "123",
"lastProfileUpdateTime" : "2021-06-26T20:08:25.089Z"
},
{
"name" : "Harry",
"id" : "124",
"lastProfileUpdateTime" : "2021-04-12T20:08:25.089Z"
},
{
"name" : "Danny Brown",
"id" : "123",
"lastProfileUpdateTime" : "2021-07-26T20:08:25.089Z"
},
{
"name" : "Danny Smith",
"id" : "123",
"lastProfileUpdateTime" : "2021-08-26T20:08:25.089Z"
}
I have a usecase where i need to find if a particular id has been updated or not and filter out latest profile. In above case since id:123 has updated profile. so expected outcome should be:
{
"name" : "Danny Smith",
"id" : "123",
"lastProfileUpdateTime" : "2021-08-26T20:08:25.089Z"
}
If a id has more than one entry, pick the one which has latest lastProfileUpdateTime.

This can be done using
Terms aggregation
: to group by "id"
Bucket selector: to get ids where count > 1
Top_hits: To get latest document for given id
Query
{
"size": 0,
"aggs": {
"Updated_docs": {
"terms": {
"field": "id.keyword",
"size": 10
},
"aggs": {
"filter_count_more_than_one": {
"bucket_selector": {
"buckets_path": {
"count": "_count"
},
"script": "params.count>1"
}
},
"latest_document": {
"top_hits": {
"size": 1,
"sort": [
{
"lastProfileUpdateTime": "desc"
}
]
}
}
}
}
}
}
Result
"aggregations" : {
"Updated_docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "123",
"doc_count" : 3,
"latest_document" : {
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "index4",
"_type" : "_doc",
"_id" : "huE1i3sBZ5_aIkj8cvpR",
"_score" : null,
"_source" : {
"name" : "Danny Smith",
"id" : "123",
"lastProfileUpdateTime" : "2021-08-26T20:08:25.089Z"
},
"sort" : [
1630008505089
]
}
]
}
}
}
]
}
}

Related

How to return hit term in ES ?

I try to return only the terms that were successfully hit instead of the document itself, but I don’t know how to achieve the desired effect。
"es_episode" : {
"aliases" : { },
"mappings" : {
"properties" : {
"endTime" : {
"type" : "long"
},
"episodeId" : {
"type" : "long"
},
"startTime" : {
"type" : "long"
},
"studentIds" : {
"type" : "long"
}
}
}
This is an example:
{
"episodeId":124,
"startTime":10,
"endTime":20,
"studentIds":[200,300]
}
My query:
GET /es_episode/_search
{
"_source": ["studentIds"],
"query": {
"terms": {
"studentIds": [300,400]
}
}
}
The result is
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "es_episode",
"_type" : "episode",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"studentIds" : [
200,
300
]
}
}
]
}
But in fact I only want to know which term hits. For example, the result I want should be studentIds=[300] instead of all studentIds=[200,300] of the returned document. It seems that some additional operations are required, but I don’t know
how.
I try to achieve my goal with the following query
GET /es_episode/_search
{
"_source": ["studentIds"],
"query": {
"terms": {
"studentIds": [300,400]
}
},
"aggs": {
"student_id": {
"terms": {
"field": "studentIds",
"size": 10
},
"aggs": {
"id": {
"terms": {
"field": "episodeId"
}
},
"id_select":{
"bucket_selector": {
"buckets_path": {
"key" : "_key"
},
"script": "params.key==300 || params.key==400"
}
}
}
}
}
}
the result for this is
"aggregations" : {
"student_id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 300,
"doc_count" : 1,
"id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 124,
"doc_count" : 1
}
]
}
}
]
}
}
It seems that I successfully filtered out the terms I don’t want, but this doesn’t look pretty, and I need to set my parameters repeatedly in the script

global sorting across different buckets after aggregation in elasticsearch

a sample in my document is as shown below.
{"rackName" : "rack005", "roomName" : "roomB", "power" : 132, "timestamp" : 1594540106208}
the thing I wanna do is get the latest data of each rack in a given room then sort them by power.
with the code below I did something to get close to my target.losing mind with the last step which seems like soring my data cross different buckets by field 'power'.
GET /power/_search
{
"query": {
"term": {
"roomName.keyword": {
"value": "roomB"
}
}
},
"aggs": {
"rk_ag": {
"terms": {
"field": "rackName"
},
"aggs": {
"latest": {
"top_hits": {
"sort": [
{
"timestamp": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}
-----------------------------------result-------------------------------------------------------
"aggregations" : {
"rk_ag" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "rack003",
"doc_count" : 4,
"latest" : {
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "power",
"_type" : "_doc",
"_id" : "0FXVQnMB8DPB7H9t6U0E",
"_score" : null,
"_source" : {
"rackName" : "rack003",
"roomName" : "roomB",
"power" : 115,
"timestamp" : 1594540117492
},
"sort" : [
1594540117492
]
}
]
}
}
},
{
"key" : "rack004",
"doc_count" : 4,
"latest" : {
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "power",
"_type" : "_doc",
"_id" : "1FXVQnMB8DPB7H9t6U0E",
"_score" : null,
"_source" : {
"rackName" : "rack004",
"roomName" : "roomB",
"power" : 108,
"timestamp" : 1594540117492
},
"sort" : [
1594540117492
]
}
]
}
}
},
{
"key" : "rack005",
"doc_count" : 4,
"latest" : {
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "power",
"_type" : "_doc",
"_id" : "2FXVQnMB8DPB7H9t6U0E",
"_score" : null,
"_source" : {
"rackName" : "rack005",
"roomName" : "roomB",
"power" : 118,
"timestamp" : 1594540114492
},
"sort" : [
1594540114492
]
}
]
}
}
}
]
}
}
You're sorting by timestamp instead of power. Try this instead:
GET /power/_search
{
"query": {
"term": {
"roomName.keyword": {
"value": "roomB"
}
}
},
"aggs": {
"rk_ag": {
"terms": {
"field": "rackName"
},
"aggs": {
"latest": {
"top_hits": {
"sort": [
{
"power": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}
You can sort by multiple fields too.
Adding to #Joe's answer. As he mentioned, you can use multiple fields in the sort.
Below query would give you what you are looking for:
POST my_rack_index/_search
{
"size": 0,
"query": {
"term": {
"roomName.keyword": {
"value": "roomB"
}
}
},
"aggs": {
"rk_ag": {
"terms": {
"field": "rackName"
},
"aggs": {
"latest": {
"top_hits": {
"sort": [ <---- Note this part
{
"timestamp": {
"order": "desc"
}
},
{
"power": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}
So now if for every rack you have two documents having same rackName with exact same power, the one with the latest timestamp would be showing up in the response.
The way sort would work is, first it would sort based on the timestamp, then it would do the sorting based on power by keeping the sort based on timestamp intact.

How to get last and first document ids by given criteria

I have some documents indexed on Elasticsearch, looking like these samples:
{"id":"1","isMigrated":true}
{"id":"2","isMigrated":true}
{"id":"3","isMigrated":false}
{"id":"4","isMigrated":false}
how can i get in one query the last migrated id and first not migrated id?
Any ideas?
Filter aggregation and top_hits aggregation can be used to get last migrated and first not migrated
{
"size": 0,
"aggs": {
"migrated": {
"filter": { --> filter where isMigrated:true
"term": {
"isMigrated": true
}
},
"aggs": {
"last_migrated": { --> get first documents sorted on id in descending order
"top_hits": {
"size": 1,
"sort": [{"id.keyword":"desc"}]
}
}
}
},
"not_migrated": {
"filter": {
"term": {
"isMigrated": false
}
},
"aggs": {
"first_not_migrated": {
"top_hits": {
"size": 1,
"sort": [{"id.keyword":"asc"}] -->any keyword field can be used to sort
}
}
}
}
}
}
Result:
"aggregations" : {
"not_migrated" : {
"doc_count" : 2,
"first_not_migrated" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "index86",
"_type" : "_doc",
"_id" : "TxuKUHIB8mx5yKbJ_rGH",
"_score" : null,
"_source" : {
"id" : "3",
"isMigrated" : false
},
"sort" : [
"3"
]
}
]
}
}
},
"migrated" : {
"doc_count" : 2,
"last_migrated" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "index86",
"_type" : "_doc",
"_id" : "ThuKUHIB8mx5yKbJ87HF",
"_score" : null,
"_source" : {
"id" : "2",
"isMigrated" : true
},
"sort" : [
"2"
]
}
]
}
}
}
}
You can store the timestamp information with each document and query based on the latest timestamp and isMigrated: true condition.
As per comment, https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html can you used to combine multiple boolean conditions.

Elastic Search Intersection Query

I want to fetch common words of list of users sorted by total count.
example:
I have a index of words used by a user.
docs:
[
{
user_id: 1,
word: 'food',
count: 2
},
{
user_id: 1,
word: 'thor',
count: 1
},
{
user_id: 1,
word: 'beer',
count: 7
},
{
user_id: 2,
word: 'summer',
count: 12
},
{
user_id: 2,
word: 'thor',
count: 4
},
{
user_id: 1,
word: 'beer',
count: 2
},
..otheruserdetails..
]
input: user_ids: [1, 2]
desired output:
[
{
'word': 'beer',
'total_count': 9
},
{
'word': 'thor',
'total_count': 5
}
]
what I have so far:
fetch all docs using user_id in user_id list (bool should query)
process docs in app layer.
loop through each keyword
check if keyword is present for each user_id
if yes, find count
else, dispose and go to next keyword
However, this is not feasible because word docs are gonna grow huge and app layer won't keep-up. any way to move this to ES query?
You can use Terms aggregation and Value Count aggregation
One can look at "Terms aggregation" as a "Group By". Output will give a unique list of userIds, list of all words under user and finally count of each word
{
"from": 0,
"size": 10,
"query": {
"terms": {
"user_id": [
"1",
"2"
]
}
},
"aggs": {
"users": {
"terms": {
"field": "user_id",
"size": 10
},
"aggs": {
"words": {
"terms": {
"field": "word.keyword",
"size": 10
},
"aggs": {
"word_count": {
"value_count": {
"field": "word.keyword"
}
}
}
}
}
}
}
}
Result
"hits" : [
{
"_index" : "index89",
"_type" : "_doc",
"_id" : "gFRzr3ABAWOsYG7t2tpt",
"_score" : 1.0,
"_source" : {
"user_id" : 1,
"word" : "thor",
"count" : 1
}
},
{
"_index" : "index89",
"_type" : "_doc",
"_id" : "flRzr3ABAWOsYG7t0dqI",
"_score" : 1.0,
"_source" : {
"user_id" : 1,
"word" : "food",
"count" : 2
}
},
{
"_index" : "index89",
"_type" : "_doc",
"_id" : "f1Rzr3ABAWOsYG7t19ps",
"_score" : 1.0,
"_source" : {
"user_id" : 2,
"word" : "thor",
"count" : 4
}
},
{
"_index" : "index89",
"_type" : "_doc",
"_id" : "gVRzr3ABAWOsYG7t8NrR",
"_score" : 1.0,
"_source" : {
"user_id" : 1,
"word" : "food",
"count" : 2
}
},
{
"_index" : "index89",
"_type" : "_doc",
"_id" : "glRzr3ABAWOsYG7t-Npj",
"_score" : 1.0,
"_source" : {
"user_id" : 1,
"word" : "thor",
"count" : 1
}
},
{
"_index" : "index89",
"_type" : "_doc",
"_id" : "g1Rzr3ABAWOsYG7t_9po",
"_score" : 1.0,
"_source" : {
"user_id" : 2,
"word" : "thor",
"count" : 4
}
}
]
},
"aggregations" : {
"users" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1,
"doc_count" : 4,
"words" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "food",
"doc_count" : 2,
"word_count" : {
"value" : 2
}
},
{
"key" : "thor",
"doc_count" : 2,
"word_count" : {
"value" : 2
}
}
]
}
},
{
"key" : 2,
"doc_count" : 2,
"words" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "thor",
"doc_count" : 2,
"word_count" : {
"value" : 2
}
}
]
}
}
]
}
}
You can use aggregations along with filter for the user like below:
{
"size": 0,
"aggs": {
"words_stats": {
"filter": {
"terms": {
"user_id": [
"1",
"2"
]
}
},
"aggs": {
"words": {
"terms": {
"field": "word.keyword"
},
"aggs": {
"total_count": {
"sum": {
"field": "count"
}
}
}
}
}
}
}
}
The results will be:
{
"key" : "beer",
"doc_count" : 2,
"total_count" : {
"value" : 9.0
}
},
{
"key" : "thor",
"doc_count" : 2,
"total_count" : {
"value" : 5.0
}
},
{
"key" : "food",
"doc_count" : 1,
"total_count" : {
"value" : 2.0
}
},
{
"key" : "summer",
"doc_count" : 1,
"total_count" : {
"value" : 12.0
}
}
Here is what I had to do:
I have referred to #Rakesh Chandru & #jaspreet chahal's answers' and came up with this. this query handles intersection and sorting.
Process:
filter by user_ids
group_by(terms aggs) on keyword (word in example),
order by aggregating (sum) counts
{
size: 0, // because we do not want result of filtered records
query: {
terms: { user_id: user_ids } // filter by user_ids
},
aggs: {
group_by_keyword: {
terms: {
field: "keyword", // group by keyword
min_doc_count: 2, // where count >= 2
order: { agg_count: "desc" }, // order by count
size
},
aggs: {
agg_count: {
sum: {
field: "count" // aggregating count
}
}
}
}
}
}

Elasticsearch, terms aggs according to sibling nested fields

Elasticsearch v7.5
Hello and good day!
We have 2 indices named socialmedia and influencers
Sample contents:
socialmedia:
{
'_id' : 1001,
'title' : "Title 1",
'smp_id' : 1,
"latest" : [
{
"soc_mm_score" : "5",
}
]
},
{
'_id' : 1002,
'title' : "Title 2",
'smp_id' : 2,
"latest" : [
{
"soc_mm_score" : "10",
}
]
},
{
'_id' : 1003,
'title' : "Title 3",
'smp_id' : 3,
"latest" : [
{
"soc_mm_score" : "35",
}
]
},
{
'_id' : 1004,
'title' : "Title 4",
'smp_id' : 2,
"latest" : [
{
"soc_mm_score" : "30",
}
]
}
//omitted some other fields
influencers:
{
'_id' : 1,
'name' : "John",
'smp_id' : 1
},
{
'_id' : 2,
'name' : "Peter",
'smp_id' : 2
},
{
'_id' : 3,
'name' : "Mark",
'smp_id' : 3
}
Now I have this simple query that determines which documents in the socialmedia index has the most latest.soc_mm_score value, and also displaying their corresponding influencers determined by the smp_id
GET socialmedia/_search
{
"size": 0,
"_source": "latest",
"query": {
"match_all": {}
},
"aggs": {
"LATEST": {
"nested": {
"path": "latest"
},
"aggs": {
"MM_SCORE": {
"terms": {
"field": "latest.soc_mm_score",
"order": {
"_key": "desc"
},
"size": 3
},
"aggs": {
"REVERSE": {
"reverse_nested": {},
"aggs": {
"SMP_ID": {
"top_hits": {
"_source": ["smp_id"],
"size": 1
}
}
}
}
}
}
}
}
}
}
SAMPLE OUTPUT:
"aggregations" : {
"LATEST" : {
"doc_count" : //omitted,
"MM_SCORE" : {
"doc_count_error_upper_bound" : //omitted,
"sum_other_doc_count" : //omitted,
"buckets" : [
{
"key" : 35,
"doc_count" : 1,
"REVERSE" : {
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1003",
"_score" : 1.0,
"_source" : {
"smp_id" : "3"
}
}
]
}
}
}
},
{
"key" : 30,
"doc_count" : 1,
"REVERSE" : {
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1004",
"_score" : 1.0,
"_source" : {
"smp_id" : "2"
}
}
]
}
}
}
},
{
"key" : 10,
"doc_count" : 1,
"REVERSE" : {
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1002",
"_score" : 1.0,
"_source" : {
"smp_id" : "2"
}
}
]
}
}
}
}
]
}
}
}
with the query above, I was able to successfully display which documents have the highest latest.soc_mm_score values
The sample output above only displays DOCUMENTS, telling that the influencers (a.k.a smp_id) related to them are the TOP INFLUENCERS according to latest.soc_mm_score
Ideally just by using this aggs query,
"terms" : {
"field" : "smp_id"
}
portrays the concept of which influencers are the top according to the doc_count
Now, displaying the terms query according to latest.soc_mm_score displays TOP DOCUMENTS
"terms" : {
"field" : "latest.soc_mm_score"
}
REAL OBJECTIVE:
I want to display the TOP INFLUENCERS according to the latest.soc_mm_count in the socialmedia index. If Elasticsearch can count all the documents where according to unique smp_id, is there a way for ES to sum all latest.soc_mm_score values and use it as terms?
My objective above should output these:
smp_id 2 as the Top Influencer because he has 2 posts (with soc_mm_score of 30 and 10), adding them gets him 40 soc_mm_score
smp_id 3 as the 2nd Top Influencer, he has 1 post with 35 soc_mm_score
smp_id 1 as the 3rd Top Influencer, he has 1 post with 5 soc_mm_score
Is there a proper query to meet this objective?
FINALLY! FOUND AN ANSWER!!!
"aggs": {
"INFS": {
"terms": {
"field": "smp_id.keyword",
"order": {
"LATEST > SUM_SVALUE": "desc"
}
},
"aggs": {
"LATEST": {
"nested": {
"path": "latest"
},
"aggs": {
"SUM_SVALUE": {
"sum" : {
"field": "latest.soc_mm_score"
}
}
}
}
}
}
}
Displays the following sample:

Resources