Get an aggregate count in elasticsearch based on particular uniqueid field - elasticsearch

I have created an index and indexed the document in elasticsearch it's working fine but here the challenge is i have to get an aggregate count of category field based on uniqueid i have given my sample documents below.
{
"UserID":"A1001",
"Category":"initiated",
"policyno":"5221"
},
{
"UserID":"A1001",
"Category":"pending",
"policyno":"5222"
},
{
"UserID":"A1001",
"Category":"pending",
"policyno":"5223"
},
{
"UserID":"A1002",
"Category":"completed",
"policyno":"5224"
}
**Sample output for UserID - "A1001"**
initiated-1
pending-2
**Sample output for UserID - "A1002"**
completed-1
How to get the aggregate count from above given Json documents like the sample output mentioned above

I suggest a terms aggregation as shown in the following:
{
"size": 0,
"aggs": {
"By_ID": {
"terms": {
"field": "UserID.keyword"
},
"aggs": {
"By_Category": {
"terms": {
"field": "Category.keyword"
}
}
}
}
}
}
Here is a snippet of the response:
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"By_ID" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "A1001",
"doc_count" : 3,
"By_Category" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "pending",
"doc_count" : 2
},
{
"key" : "initiated",
"doc_count" : 1
}
]
}
},
{
"key" : "A1002",
"doc_count" : 1,
"By_Category" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "completed",
"doc_count" : 1
}
]
}
}
]
}
}

Related

How to get word count in docs as a aggregate over time in elastic search?

I am trying to get word count trends in docs as aggregate result . Although using the following approach I am able to get the doc count aggregation result but I am not able to find any resources using which I can get word count for the month of jan , feb & mar
PUT test/_doc/1
{
"description" : "one two three four",
"month" : "jan"
}
PUT test/_doc/2
{
"description" : "one one test test test",
"month" : "feb"
}
PUT test/_doc/3
{
"description" : "one one one test",
"month" : "mar"
}
GET test/_search
{
"size": 0,
"query": {
"match": {
"description": {
"query": "one"
}
}
},
"aggs": {
"monthly_count": {
"terms": {
"field": "month.keyword"
}
}
}
}
OUTPUT
{
"took" : 706,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"monthly_count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "feb",
"doc_count" : 1
},
{
"key" : "jan",
"doc_count" : 1
},
{
"key" : "mar",
"doc_count" : 1
}
]
}
}
}
EXPECTED WORD COUNT OVER MONTH
"aggregations" : {
"monthly_count" : {
"buckets" : [
{
"key" : "feb",
"word_count" : 2
},
{
"key" : "jan",
"word_count" : 1
},
{
"key" : "mar",
"word_count" : 3
}
]
}
}
Maybe this query can help you:
GET test/_search
{
"size": 0,
"aggs": {
"monthly_count": {
"terms": {
"field": "month.keyword"
},
"aggs": {
"count_word_one": {
"terms": {
"script": {
"source": """
def str = doc['description.keyword'].value;
def array = str.splitOnToken(' ');
int i = 0;
for (item in array) {
if(item == 'one'){
i++
}
}
return i;
"""
},
"size": 10
}
}
}
}
}
}
Response:
"aggregations" : {
"monthly_count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "feb",
"doc_count" : 1,
"count_word_one" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "2",
"doc_count" : 1
}
]
}
},
{
"key" : "jan",
"doc_count" : 1,
"count_word_one" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1
}
]
}
},
{
"key" : "mar",
"doc_count" : 1,
"count_word_one" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "3",
"doc_count" : 1
}
]
}
}
]
}
}

Aggregate by custom defined buckets, according to field value

I'm interested in aggregating my data into buckets, but I want to put two distinct values to the same bucket.
This is what I mean:
Say I have this query:
GET _search
{
"size": 0,
"aggs": {
"my-agg-name": {
"terms": {
"field": "ecs.version"
}
}
}
}
it returns this response:
"aggregations" : {
"my-agg-name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1.12.0",
"doc_count" : 642826144
},
{
"key" : "8.0.0",
"doc_count" : 204064845
},
{
"key" : "1.1.0",
"doc_count" : 16508253
},
{
"key" : "1.0.0",
"doc_count" : 9162928
},
{
"key" : "1.6.0",
"doc_count" : 1111542
},
{
"key" : "1.5.0",
"doc_count" : 10445
}
]
}
}
every distinct value of the field ecs.version is in it's own bucket.
But say I wanted to define my buckets such that:
bucket1: [1.12.0, 8.0.0]
bucket2: [1.6.0, 8.4.0]
bucket3: [1.0.0, 8.8.0]
Is this possible in anyway?
I know I can just return all the buckets and do the sum programmatically, but this list can be very long, I don't think it would be efficient. Am I wrong?
You can use Runtime Mapping to generat runtime field and that field will be use for aggregation. I have done below exmaple on ES 7.16.
I have index some of the sample document and below is aggregation output without join on multipul values:
"aggregations" : {
"version" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1.12.0",
"doc_count" : 3
},
{
"key" : "1.6.0",
"doc_count" : 3
},
{
"key" : "8.4.0",
"doc_count" : 3
},
{
"key" : "8.0.0",
"doc_count" : 2
}
]
}
}
You can use below query with runtime mapping but you need to add multipul if condition for your version mappings:
{
"size": 0,
"runtime_mappings": {
"normalized_version": {
"type": "keyword",
"script": """
String version = doc['version.keyword'].value;
if (version.equals('1.12.0') || version.equals('8.0.0')) {
emit('1.12.0, 8.0.0');
} else if (version.equals('1.6.0') || version.equals('8.4.0')){
emit('1.6.0, 8.4.0');
}else {
emit(version);
}
"""
}
},
"aggs": {
"genres": {
"terms": {
"field": "normalized_version"
}
}
}
}
Below is output of above aggregation query:
"aggregations" : {
"genres" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1.6.0, 8.4.0",
"doc_count" : 6
},
{
"key" : "1.12.0, 8.0.0",
"doc_count" : 5
}
]
}
}

Elasticsearch, sort aggs according to sibling fields but from different index

Elasticsearch v7.5
Hello and good day!
We have 2 indices named socialmedia and influencers
Sample contents:
socialmedia:
{
'_id' : 1001,
'title' : "Title 1",
'smp_id' : 1,
},
{
'_id' : 1002,
'title' : "Title 2",
'smp_id' : 2,
},
{
'_id' : 1003,
'title' : "Title 3",
'smp_id' : 3,
}
//omitted other documents
influencers
{
'_id' : 1,
'name' : "John",
'smp_id' : 1,
'smp_score' : 5
},
{
'_id' : 2,
'name' : "Peter",
'smp_id' : 2,
'smp_score' : 10
},
{
'_id' : 3,
'name' : "Mark",
'smp_id' : 3,
'smp_score' : 15
}
//omitted other documents
Now I have this simple query that determines which influencer has the most document in the socialmedia index
GET socialmedia/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"INFLUENCERS": {
"terms": {
"field": "smp_id.keyword"
//smp_id is a **text** based field, that's why we have `.keyword` here
}
}
}
}
SAMPLE OUTPUT:
"aggregations" : {
"INFLUENCERS" : {
"doc_count_error_upper_bound" : //omitted,
"sum_other_doc_count" : //omitted,
"buckets" : [
{
"key" : "1",
"doc_count" : 87258
},
{
"key" : "2",
"doc_count" : 36518
},
{
"key" : "3",
"doc_count" : 34838
},
]
}
}
OBJECTIVE:
My query is able to sort the influencers according to doc_count of their posts in the socialmedia index, now, is there a way for us to sort the INFLUENCERS aggregation or make a way to sort out the influencers according to their SMP_SCORE?
With that idea, smp_id 3 which is Mark, should be the first one to appear since he has an smp_score of 15
Thank you in advance for your help!
What you are looking for is a JOIN operation. Note that Elasticsearch doesn't support JOIN operations unless they are modelled in a way as mentioned in this link.
Instead, a very simplistic approach is to denormalize your data and add the smp_score to your socialmedia index as below:
Mapping:
PUT socialmedia
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"keyword":{
"type":"keyword"
}
}
},
"smp_id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"smp_score": {
"type": "float"
}
}
}
}
Your ES query would then have two Terms Aggregation as shown below:
Request Query:
POST socialmedia/_search
{
"size": 0,
"aggs": {
"influencers_score_agg": {
"terms": {
"field": "smp_score",
"order": { "_key": "desc" }
},
"aggs": {
"influencers_id_agg": {
"terms": {
"field": "smp_id.keyword"
}
}
}
}
}
}
Basically we are first aggregating on the smp_score and then introducing a sub-aggregation to display the smp_id.
Response:
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_influencers_score" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 15.0,
"doc_count" : 1,
"influencers" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "3",
"doc_count" : 1
}
]
}
},
{
"key" : 10.0,
"doc_count" : 1,
"influencers" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "2",
"doc_count" : 1
}
]
}
},
{
"key" : 5.0,
"doc_count" : 1,
"influencers" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1
}
]
}
}
]
}
}
}
Do spend sometime in reading the above link, however that would require you to model your index in a different way depending on the options mentioned in it. From what I understand, the solution I've provided would suffice.

About elasticsearch group by two fields and then filter or order

There is a shareholder index want to get below info
which holder invest the same company the most times
select hld_id, com_id, count(*) from shareholder group by hld_id, com_id order by count(*) desc;
which holder invest company just two times, maybe duplicate records
select hld_id, com_id from shareholder group by hld_id, com_id having count(*) = 2;
So how to implement above requirements by elasticsearch search query?
Below is the sample mapping, documents and aggregation query. I've figured three possible ways this can be done/achieved.
Mapping:
PUT shareholder
{
"mappings": {
"properties": {
"hld_id": {
"type": "keyword"
},
"com_id":{
"type": "keyword"
}
}
}
}
Documents:
POST shareholder/_doc/1
{
"hld_id": "001",
"com_id": "001"
}
POST shareholder/_doc/2
{
"hld_id": "001",
"com_id": "002"
}
POST shareholder/_doc/3
{
"hld_id": "002",
"com_id": "001"
}
POST shareholder/_doc/4
{
"hld_id": "002",
"com_id": "002"
}
POST shareholder/_doc/5
{
"hld_id": "002",
"com_id": "002" <--- Note I've changed this
}
Solution 1: Using Elasticsearch's aggregation
Aggregation Query: 1
Note that I've just made use of Terms Query pipelined firstly with hld_id and then with com_id
POST shareholder/_search
{
"size": 0,
"aggs": {
"share_hoder": {
"terms": {
"field": "hld_id"
},
"aggs": {
"com_aggs": {
"terms": {
"field": "com_id",
"order": {
"_count": "desc"
}
}
}
}
}
}
}
Below is how the response appear:
Response:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"share_hoder" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "002",
"doc_count" : 3,
"com_aggs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "002",
"doc_count" : 2 <---- Count you are looking for
},
{
"key" : "001",
"doc_count" : 1
}
]
}
},
{
"key" : "001",
"doc_count" : 2,
"com_aggs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "001",
"doc_count" : 1
},
{
"key" : "002",
"doc_count" : 1
}
]
}
}
]
}
}
}
Of course you may not get the representation of the result exactly as you are looking for because of the way Elasticsearch's aggregation works.
Aggregation Query: 2
For this, most of it is same as aggregation_1, where I've used two Terms Query but I've additionally made use of Cardinality Aggregation Query to get the count of hld_id and then I used further Bucket Selector Aggregation in which I've added the conditions for count()==2
POST shareholder/_search
{
"size": 0,
"aggs": {
"share_holder": {
"terms": {
"field": "hld_id",
"order": {
"_key": "desc"
}
},
"aggs": {
"com_aggs": {
"terms": {
"field": "com_id"
},
"aggs": {
"count_filter":{
"bucket_selector": {
"buckets_path": {
"count_path": "_count"
},
"script": "params.count_path == 2"
}
}
}
}
}
}
}
}
Below is how the response appears.
Response:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"share_holder" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "002",
"doc_count" : 3,
"com_aggs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "002",
"doc_count" : 2 <---- Count == 2
}
]
}
},
{
"key" : "001",
"doc_count" : 2,
"com_aggs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
}
}
Note that the second bucket is empty. I'm trying to see if I can filter the above query so that "key": "001" doesn't appear in first place.
Solution 2: Using Elasticsearch SQL:
If you have the x-pack version of Kibana, you can probably execute the below queries in SQLish style:
Query:1
POST /_sql?format=txt
{
"query": "SELECT hld_id, com_id, count(*) FROM shareholder GROUP BY hld_id, com_id ORDER BY count(*) desc"
}
Response:
hld_id | com_id | count(*)
---------------+---------------+---------------
002 |002 |2
001 |001 |1
001 |002 |1
002 |001 |1
Query 2:
POST /_sql?format=txt
{
"query": "SELECT hld_id, com_id FROM shareholder GROUP BY hld_id, com_id HAVING count(*) = 2"
}
Response:
hld_id | com_id
---------------+---------------
002 |002
Solution 3: Using Script in Terms Aggregation
Aggregation Query:
POST shareholder/_search
{
"size": 0,
"aggs": {
"query_groupby_count": {
"terms": {
"script": {
"source": """
doc['hld_id'].value + ", " + doc['com_id'].value
"""
}
}
},
"query_groupby_count_equals_2": {
"terms": {
"script": {
"source": """
doc['hld_id'].value + ", " + doc['com_id'].value
"""
}
},
"aggs": {
"myaggs": {
"bucket_selector": {
"buckets_path": {
"count": "_count"
},
"script": "params.count == 2"
}
}
}
}
}
}
Response:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"query_groupby_count_equals_2" : { <---- Group By Query For Count == 2
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "002, 002",
"doc_count" : 2
}
]
},
"query_groupby_count" : { <---- Group By Query
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "002, 002",
"doc_count" : 2
},
{
"key" : "001, 001",
"doc_count" : 1
},
{
"key" : "001, 002",
"doc_count" : 1
},
{
"key" : "002, 001",
"doc_count" : 1
}
]
}
}
}
Using CURL:
First let us save the query in a .txt or .json file.
For e.g I created a file called query.json, copy and pasted only the query in that file.
{
"query": "SELECT hld_id, com_id, count(*) FROM shareholder GROUP BY hld_id, com_id ORDER BY count(*) desc"
}
Now execute the below curl command where you'd refer the file as shown below:
curl -XGET http://localhost:9200/_sql?format=txt -H "Content-Type: application/json" -d #query.json
Hope this helps!

Is it possible with aggregation to amalgamate all values of an array property from all grouped documents into the coalesced document?

I have documents with the format similar to the following:
[
{
"name": "fred",
"title": "engineer",
"division_id": 20
"skills": [
"walking",
"talking"
]
},
{
"name": "ed",
"title": "ticket-taker",
"division_id": 20
"skills": [
"smiling"
]
}
]
I would like to run an aggs query that would show the complete set of skills for the division: ie,
{
"aggs":{
"distinct_skills":{
"cardinality":{
"field":"division_id"
}
}
},
"_source":{
"includes":[
"division_id",
"skills"
]
}
}
.. so that the resulting hit would look like:
{
"division_id": 20,
"skills": [
"walking",
"talking",
"smiling"
]
}
I know I can retrieve inner_hits and iterate through the list and amalgamate values "manually". I assume it would perform better if I could do it a query.
Just pipe two Terms Aggregation queries as shown below:
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"my_division_ids": {
"terms": {
"field": "division_id",
"size": 10
},
"aggs": {
"my_skills": {
"terms": {
"field": "skills", <---- If it is not keyword field use `skills.keyword` field if using dynamic mapping.
"size": 10
}
}
}
}
}
}
Below is the sample response:
Response:
{
"took" : 490,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_division_ids" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 20, <---- division_id
"doc_count" : 2,
"my_skills" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ <---- Skills
{
"key" : "smiling",
"doc_count" : 1
},
{
"key" : "talking",
"doc_count" : 1
},
{
"key" : "walking",
"doc_count" : 1
}
]
}
}
]
}
}
}
Hope this helps!

Resources