Elastic Search Intersection Query - elasticsearch

I want to fetch common words of list of users sorted by total count.
I have a index of words used by a user.
user_id: 1,
word: 'food',
count: 2
user_id: 1,
word: 'thor',
count: 1
user_id: 1,
word: 'beer',
count: 7
user_id: 2,
word: 'summer',
count: 12
user_id: 2,
word: 'thor',
count: 4
user_id: 1,
word: 'beer',
count: 2
input: user_ids: [1, 2]
desired output:
'word': 'beer',
'total_count': 9
'word': 'thor',
'total_count': 5
what I have so far:
fetch all docs using user_id in user_id list (bool should query)
process docs in app layer.
loop through each keyword
check if keyword is present for each user_id
if yes, find count
else, dispose and go to next keyword
However, this is not feasible because word docs are gonna grow huge and app layer won't keep-up. any way to move this to ES query?

You can use Terms aggregation and Value Count aggregation
One can look at "Terms aggregation" as a "Group By". Output will give a unique list of userIds, list of all words under user and finally count of each word
"from": 0,
"size": 10,
"query": {
"terms": {
"user_id": [
"aggs": {
"users": {
"terms": {
"field": "user_id",
"size": 10
"aggs": {
"words": {
"terms": {
"field": "word.keyword",
"size": 10
"aggs": {
"word_count": {
"value_count": {
"field": "word.keyword"
"hits" : [
"_index" : "index89",
"_type" : "_doc",
"_id" : "gFRzr3ABAWOsYG7t2tpt",
"_score" : 1.0,
"_source" : {
"user_id" : 1,
"word" : "thor",
"count" : 1
"_index" : "index89",
"_type" : "_doc",
"_id" : "flRzr3ABAWOsYG7t0dqI",
"_score" : 1.0,
"_source" : {
"user_id" : 1,
"word" : "food",
"count" : 2
"_index" : "index89",
"_type" : "_doc",
"_id" : "f1Rzr3ABAWOsYG7t19ps",
"_score" : 1.0,
"_source" : {
"user_id" : 2,
"word" : "thor",
"count" : 4
"_index" : "index89",
"_type" : "_doc",
"_id" : "gVRzr3ABAWOsYG7t8NrR",
"_score" : 1.0,
"_source" : {
"user_id" : 1,
"word" : "food",
"count" : 2
"_index" : "index89",
"_type" : "_doc",
"_id" : "glRzr3ABAWOsYG7t-Npj",
"_score" : 1.0,
"_source" : {
"user_id" : 1,
"word" : "thor",
"count" : 1
"_index" : "index89",
"_type" : "_doc",
"_id" : "g1Rzr3ABAWOsYG7t_9po",
"_score" : 1.0,
"_source" : {
"user_id" : 2,
"word" : "thor",
"count" : 4
"aggregations" : {
"users" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
"key" : 1,
"doc_count" : 4,
"words" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
"key" : "food",
"doc_count" : 2,
"word_count" : {
"value" : 2
"key" : "thor",
"doc_count" : 2,
"word_count" : {
"value" : 2
"key" : 2,
"doc_count" : 2,
"words" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
"key" : "thor",
"doc_count" : 2,
"word_count" : {
"value" : 2

You can use aggregations along with filter for the user like below:
"size": 0,
"aggs": {
"words_stats": {
"filter": {
"terms": {
"user_id": [
"aggs": {
"words": {
"terms": {
"field": "word.keyword"
"aggs": {
"total_count": {
"sum": {
"field": "count"
The results will be:
"key" : "beer",
"doc_count" : 2,
"total_count" : {
"value" : 9.0
"key" : "thor",
"doc_count" : 2,
"total_count" : {
"value" : 5.0
"key" : "food",
"doc_count" : 1,
"total_count" : {
"value" : 2.0
"key" : "summer",
"doc_count" : 1,
"total_count" : {
"value" : 12.0

Here is what I had to do:
I have referred to #Rakesh Chandru & #jaspreet chahal's answers' and came up with this. this query handles intersection and sorting.
filter by user_ids
group_by(terms aggs) on keyword (word in example),
order by aggregating (sum) counts
size: 0, // because we do not want result of filtered records
query: {
terms: { user_id: user_ids } // filter by user_ids
aggs: {
group_by_keyword: {
terms: {
field: "keyword", // group by keyword
min_doc_count: 2, // where count >= 2
order: { agg_count: "desc" }, // order by count
aggs: {
agg_count: {
sum: {
field: "count" // aggregating count


Elasticsearch DSL query

"name" : "Danny",
"id" : "123",
"lastProfileUpdateTime" : "2021-06-26T20:08:25.089Z"
"name" : "Harry",
"id" : "124",
"lastProfileUpdateTime" : "2021-04-12T20:08:25.089Z"
"name" : "Danny Brown",
"id" : "123",
"lastProfileUpdateTime" : "2021-07-26T20:08:25.089Z"
"name" : "Danny Smith",
"id" : "123",
"lastProfileUpdateTime" : "2021-08-26T20:08:25.089Z"
I have a usecase where i need to find if a particular id has been updated or not and filter out latest profile. In above case since id:123 has updated profile. so expected outcome should be:
"name" : "Danny Smith",
"id" : "123",
"lastProfileUpdateTime" : "2021-08-26T20:08:25.089Z"
If a id has more than one entry, pick the one which has latest lastProfileUpdateTime.
This can be done using
Terms aggregation
: to group by "id"
Bucket selector: to get ids where count > 1
Top_hits: To get latest document for given id
"size": 0,
"aggs": {
"Updated_docs": {
"terms": {
"field": "id.keyword",
"size": 10
"aggs": {
"filter_count_more_than_one": {
"bucket_selector": {
"buckets_path": {
"count": "_count"
"script": "params.count>1"
"latest_document": {
"top_hits": {
"size": 1,
"sort": [
"lastProfileUpdateTime": "desc"
"aggregations" : {
"Updated_docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
"key" : "123",
"doc_count" : 3,
"latest_document" : {
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
"max_score" : null,
"hits" : [
"_index" : "index4",
"_type" : "_doc",
"_id" : "huE1i3sBZ5_aIkj8cvpR",
"_score" : null,
"_source" : {
"name" : "Danny Smith",
"id" : "123",
"lastProfileUpdateTime" : "2021-08-26T20:08:25.089Z"
"sort" : [

Elasticsearch, terms aggs according to sibling nested fields

Elasticsearch v7.5
Hello and good day!
We have 2 indices named socialmedia and influencers
Sample contents:
'_id' : 1001,
'title' : "Title 1",
'smp_id' : 1,
"latest" : [
"soc_mm_score" : "5",
'_id' : 1002,
'title' : "Title 2",
'smp_id' : 2,
"latest" : [
"soc_mm_score" : "10",
'_id' : 1003,
'title' : "Title 3",
'smp_id' : 3,
"latest" : [
"soc_mm_score" : "35",
'_id' : 1004,
'title' : "Title 4",
'smp_id' : 2,
"latest" : [
"soc_mm_score" : "30",
//omitted some other fields
'_id' : 1,
'name' : "John",
'smp_id' : 1
'_id' : 2,
'name' : "Peter",
'smp_id' : 2
'_id' : 3,
'name' : "Mark",
'smp_id' : 3
Now I have this simple query that determines which documents in the socialmedia index has the most latest.soc_mm_score value, and also displaying their corresponding influencers determined by the smp_id
GET socialmedia/_search
"size": 0,
"_source": "latest",
"query": {
"match_all": {}
"aggs": {
"nested": {
"path": "latest"
"aggs": {
"terms": {
"field": "latest.soc_mm_score",
"order": {
"_key": "desc"
"size": 3
"aggs": {
"reverse_nested": {},
"aggs": {
"SMP_ID": {
"top_hits": {
"_source": ["smp_id"],
"size": 1
"aggregations" : {
"LATEST" : {
"doc_count" : //omitted,
"MM_SCORE" : {
"doc_count_error_upper_bound" : //omitted,
"sum_other_doc_count" : //omitted,
"buckets" : [
"key" : 35,
"doc_count" : 1,
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
"max_score" : 1.0,
"hits" : [
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1003",
"_score" : 1.0,
"_source" : {
"smp_id" : "3"
"key" : 30,
"doc_count" : 1,
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
"max_score" : 1.0,
"hits" : [
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1004",
"_score" : 1.0,
"_source" : {
"smp_id" : "2"
"key" : 10,
"doc_count" : 1,
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
"max_score" : 1.0,
"hits" : [
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1002",
"_score" : 1.0,
"_source" : {
"smp_id" : "2"
with the query above, I was able to successfully display which documents have the highest latest.soc_mm_score values
The sample output above only displays DOCUMENTS, telling that the influencers (a.k.a smp_id) related to them are the TOP INFLUENCERS according to latest.soc_mm_score
Ideally just by using this aggs query,
"terms" : {
"field" : "smp_id"
portrays the concept of which influencers are the top according to the doc_count
Now, displaying the terms query according to latest.soc_mm_score displays TOP DOCUMENTS
"terms" : {
"field" : "latest.soc_mm_score"
I want to display the TOP INFLUENCERS according to the latest.soc_mm_count in the socialmedia index. If Elasticsearch can count all the documents where according to unique smp_id, is there a way for ES to sum all latest.soc_mm_score values and use it as terms?
My objective above should output these:
smp_id 2 as the Top Influencer because he has 2 posts (with soc_mm_score of 30 and 10), adding them gets him 40 soc_mm_score
smp_id 3 as the 2nd Top Influencer, he has 1 post with 35 soc_mm_score
smp_id 1 as the 3rd Top Influencer, he has 1 post with 5 soc_mm_score
Is there a proper query to meet this objective?
"aggs": {
"INFS": {
"terms": {
"field": "smp_id.keyword",
"order": {
"aggs": {
"nested": {
"path": "latest"
"aggs": {
"sum" : {
"field": "latest.soc_mm_score"
Displays the following sample:


My first post, I spend the weekend looking for an answer without a good result
I will try to explain my issue, I have this Index
0 1
1 1
0 2
1 2
0 2
1 3
0 3
For example, I need to show the last records from each ID when them are 0, for example, in this index I have to show only ID 1 and ID 2, becuase the last record has ST to 0 in ID 1 and 2
Could some try to help me with this issue?
PUT index34
"mappings": {
"properties": {
"type": "integer"
"type": "integer"
"type": "date"
"_index" : "index34",
"_type" : "_doc",
"_id" : "LO7Z7W0B_-hMjUaqtwHw",
"_score" : 1.0,
"_source" : {
"ST" : 1,
"ID" : 1,
"Date" : "2019-10-21T12:00:00Z"
"_index" : "index34",
"_type" : "_doc",
"_id" : "Le7Z7W0B_-hMjUaq0QEz",
"_score" : 1.0,
"_source" : {
"ST" : 0,
"ID" : 1,
"Date" : "2019-10-21T12:01:00Z"
"_index" : "index34",
"_type" : "_doc",
"_id" : "Lu7a7W0B_-hMjUaqAwE0",
"_score" : 1.0,
"_source" : {
"ST" : 1,
"ID" : 2,
"Date" : "2019-10-21T12:02:00Z"
"_index" : "index34",
"_type" : "_doc",
"_id" : "L-7a7W0B_-hMjUaqGAEr",
"_score" : 1.0,
"_source" : {
"ST" : 0,
"ID" : 2,
"Date" : "2019-10-21T12:04:00Z"
"_index" : "index34",
"_type" : "_doc",
"_id" : "MO7a7W0B_-hMjUaqNAGA",
"_score" : 1.0,
"_source" : {
"ST" : 0,
"ID" : 3,
"Date" : "2019-10-21T12:04:00Z"
"_index" : "index34",
"_type" : "_doc",
"_id" : "Me7a7W0B_-hMjUaqTQFP",
"_score" : 1.0,
"_source" : {
"ST" : 1,
"ID" : 3,
"Date" : "2019-10-21T12:06:00Z"
Query: I am getting max date for all terms and then getting the max value when ST was zero. If these two match(which means 0 was latest document) then I am keeping tha bucket
GET index34/_search
"size": 0,
"aggs": {
"ID": {
"terms": {
"field": "ID",
"size": 10000
"aggs": {
"maxDate": {
"max": {
"field": "Date"
"pending_status": {
"filter": {
"term": {
"ST": 0
"aggs": {
"filtered_maxdate": {
"max": {
"field": "Date"
"buckets_latest_status_pending": {
"bucket_selector": {
"buckets_path": {
"filtereddate": "pending_status>filtered_maxdate",
"maxDate": "maxDate"
"script": "params.filtereddate==params.maxDate"
"aggregations" : {
"ID" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
"key" : 1,
"doc_count" : 2,
"pending_status" : {
"doc_count" : 1,
"filtered_maxdate" : {
"value" : 1.57165926E12,
"value_as_string" : "2019-10-21T12:01:00.000Z"
"maxDate" : {
"value" : 1.57165926E12,
"value_as_string" : "2019-10-21T12:01:00.000Z"
"key" : 2,
"doc_count" : 2,
"pending_status" : {
"doc_count" : 1,
"filtered_maxdate" : {
"value" : 1.57165944E12,
"value_as_string" : "2019-10-21T12:04:00.000Z"
"maxDate" : {
"value" : 1.57165944E12,
"value_as_string" : "2019-10-21T12:04:00.000Z"

elastic query to get events where corresponding pair is missing

I have records of transaction which follow following lifecycle.
Event when transaction is received [RCVD]
Event when transaction gets pending for execution [PNDG] (OPTIONAL step)
Event when it gets executed [SENT]
Following are the 7 sample events in the index:
{trxID: 1, status:RCVD}
{trxID: 2, status:RCVD}
{trxID: 3, status:RCVD}
{trxID: 2, status:PNDG}
{trxID: 3, status:PNDG}
{trxID: 1, status:SENT}
{trxID: 2, status:SENT}
I need to find all the transactions which went to pending state but not executed yet. In other word there should be PNDG status for transaction but not SENT.
I am trying not to do it at java layer.
I did an aggregation on trxID, and then I did sub aggregation on status.
Then I cannot figure out how to get those records where bucket has only PNDG in sub-aggregation. I am not sure if I am thinking in right direction.
The result I am expecting is trxID 3 because for this transaction ,we got PNDG status but did not get SENT yet. On the other hand TrxUD 1 should not be reported as it never went to PNDG (pending) state irrespective of if SENT status is reported of not.
You can use count of status under a transaction id.
GET index24/_search
"size": 0,
"aggs": {
"transactionId": {
"terms": {
"field": "trxID",
"size": 10
"aggs": {
"status": {
"terms": {
"field": "status.keyword",
"size": 10
"count": {
"cardinality": {
"field": "status.keyword"
"my_bucketselector": {
"bucket_selector": {
"buckets_path": {
"statusCount": "count"
"script": "params.statusCount==1"
"aggregations" : {
"transactionId" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
"key" : 4,
"doc_count" : 1,
"count" : {
"value" : 1
"status" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
"key" : "PNDG",
"doc_count" : 1
I have tried with below :-
Get max date for a transaction id and then get date under pending . If both dates are same then pending is the last status
"_index" : "index24",
"_type" : "_doc",
"_id" : "aYCs0m0BD5PlkoxXxO36",
"_score" : 1.0,
"_source" : {
"trxID" : 1,
"status" : "RCVD",
"date" : "2019-10-15T12:00:00"
"_index" : "index24",
"_type" : "_doc",
"_id" : "aoCs0m0BD5PlkoxX7e35",
"_score" : 1.0,
"_source" : {
"trxID" : 1,
"status" : "PNDG",
"date" : "2019-10-15T12:01:00"
"_index" : "index24",
"_type" : "_doc",
"_id" : "a4Ct0m0BD5PlkoxXCO06",
"_score" : 1.0,
"_source" : {
"trxID" : 1,
"status" : "SENT",
"date" : "2019-10-15T12:02:00"
"_index" : "index24",
"_type" : "_doc",
"_id" : "bICt0m0BD5PlkoxXQe0Y",
"_score" : 1.0,
"_source" : {
"trxID" : 2,
"status" : "RCVD",
"date" : "2019-10-15T12:00:00"
"_index" : "index24",
"_type" : "_doc",
"_id" : "bYCt0m0BD5PlkoxXZO2x",
"_score" : 1.0,
"_source" : {
"trxID" : 2,
"status" : "PNDG",
"date" : "2019-10-15T12:01:00"
"_index" : "index24",
"_type" : "_doc",
"_id" : "boCt0m0BD5PlkoxXju1H",
"_score" : 1.0,
"_source" : {
"trxID" : 3,
"status" : "RCVD",
"date" : "2019-10-15T12:00:00"
"_index" : "index24",
"_type" : "_doc",
"_id" : "b4Ct0m0BD5PlkoxXou0-",
"_score" : 1.0,
"_source" : {
"trxID" : 3,
"status" : "SENT",
"date" : "2019-10-15T12:01:00"
GET index24/_search
"size": 0,
"aggs": {
"transactionId": {
"terms": {
"field": "trxID",
"size": 10000
"aggs": {
"maxDate": {
"max": {
"field": "date" ---> get max date under transactions
"pending_status": {
"filter": {
"term": {
"status.keyword": "PNDG" ---> filter for pending
"aggs": {
"filtered_maxdate": {
"max": {
"field": "date" --> get date under pending
"buckets_latest_status_pending": { -->filter if max date==pending date
"bucket_selector": {
"buckets_path": {
"filtereddate": "pending_status>filtered_maxdate",
"maxDate": "maxDate"
"script": "params.filtereddate==params.maxDate"
"transactionId" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
"key" : 2, --> only transaction id 2 is returned
"doc_count" : 2,
"pending_status" : {
"doc_count" : 1,
"filtered_maxdate" : {
"value" : 1.57114086E12,
"value_as_string" : "2019-10-15T12:01:00.000Z"
"maxDate" : {
"value" : 1.57114086E12,
"value_as_string" : "2019-10-15T12:01:00.000Z"
I did an aggregation on trxID, and then I did sub aggregation on status.
That's a great start !!!
Now, you can leverage the bucket_selector pipeline aggregation in order to surface only the transactions which have only 1 or 2 documents, i.e. the script condition params.eventCount < 3 makes sure to catch all buckets that have RCVD and/or PNDG documents but no SENT documents:
POST events/_search
"size": 0,
"aggs": {
"trx": {
"terms": {
"field": "trxID",
"size": 1000
"aggs": {
"count": {
"cardinality": {
"field": "status.keyword"
"not_sent": {
"bucket_selector": {
"buckets_path": {
"eventCount": "count"
"script": "params.eventCount < 3"
In your case, this would yield this, i.e. only event with trxID = 3:
"aggregations" : {
"trx" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
"key" : 3,
"doc_count" : 2,
"count" : {
"value" : 2

elasticsearch groupby and filter by regex condition

It's a bit hard for me to define the question as I'm not very experienced with Elasticsearch. I'm focusing the question on my specific problem:
Assuming I have the following records:
id: 1
name: bla1_1.aaa
id: 1
name: bla1_2.bbb
id: 2
name: bla2_1.aaa
id: 2
name: bla2_2.aaa
What I want is to GET all the ids that have all of their names ending with aaa.
I was thinking about group by id and then do a regex query like so: *\.aaa so that all the name must satisfy the regex query.
On this particular example I would get id: 2 back.
How do I do it?
Let me know if there's anything I need to add to clarify the question.
RegexExp can be used.
Wildcard .* matches any character any number of times including zero
Terms aggregation will give you unique "ids" and number of docs under them.
Mapping :
PUT regex
"mappings": {
"properties": {
"fields": {
"hits" : [
"_index" : "regex",
"_type" : "_doc",
"_id" : "olQXjW0BywGFQhV7k84P",
"_score" : 1.0,
"_source" : {
"id" : 1,
"name" : "bla1_1.aaa"
"_index" : "regex",
"_type" : "_doc",
"_id" : "o1QXjW0BywGFQhV7us6B",
"_score" : 1.0,
"_source" : {
"id" : 1,
"name" : "bla1_2.bbb"
"_index" : "regex",
"_type" : "_doc",
"_id" : "pFQXjW0BywGFQhV77c6J",
"_score" : 1.0,
"_source" : {
"id" : 2,
"name" : "bla2_1.aaa"
"_index" : "regex",
"_type" : "_doc",
"_id" : "pVQYjW0BywGFQhV7Dc6F",
"_score" : 1.0,
"_source" : {
"id" : 2,
"name" : "bla2_2.aaa"
GET regex/_search
"query": {
"regexp": {
"name.keyword": {
"value": ".*.aaa" ---> name ending with .aaa
"aggs": {
"unique_ids": {
"terms": {
"field": "id",
"size": 10
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
"max_score" : null,
"hits" : [ ]
"aggregations" : {
"unique_ids" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
"key" : 2, ---> 2 doc under id 2
"doc_count" : 2
"key" : 1, ----> 1 doc under id 1
"doc_count" : 1
Using bucket selector to keep buckets where total count of docs in Id matches with docs selected in regex
GET regex/_search
"size": 0,
"aggs": {
"unique_ids": {
"terms": {
"field": "id",
"size": 10
"aggs": {
"totalCount": { ---> to get total count of id(all docs)
"value_count": {
"field": "id"
"filter_agg": {
"filter": {
"bool": {
"must": [
"regexp": {
"name.keyword": ".*.aaa"
"aggs": {
"finalCount": { -->total count of docs matching regex
"value_count": {
"field": "id"
"mybucket_selector": { ---> include buckets where totalcount==finalcount
"bucket_selector": {
"buckets_path": {
"FinalCount": "filter_agg>finalCount",
"TotalCount": "totalCount"
"script": "params.FinalCount==params.TotalCount"
