Grouping consecutive documents with Elasticsearch - elasticsearch

Is there a way to make Elasticsearch consider sequence-gaps when grouping?
Provided that the following data was bulk-imported to Elasticsearch:
{ "index": { "_index": "test", "_type": "groupingTest", "_id": "1" } }
{ "sequence": 1, "type": "A" }
{ "index": { "_index": "test", "_type": "groupingTest", "_id": "2" } }
{ "sequence": 2, "type": "A" }
{ "index": { "_index": "test", "_type": "groupingTest", "_id": "3" } }
{ "sequence": 3, "type": "B" }
{ "index": { "_index": "test", "_type": "groupingTest", "_id": "4" } }
{ "sequence": 4, "type": "A" }
{ "index": { "_index": "test", "_type": "groupingTest", "_id": "5" } }
{ "sequence": 5, "type": "A" }
Is there a way to query this data in a way that
the documents with sequence number 1 and 2 go to one output group,
the document with sequence number 3 goes to another one, and
the documents with sequence number 4 and 5 go to a third group?
... considering the fact that the type A sequence is interrupted by a type B item (or any other item that's not type A)?
I would like result buckets to look something like this (name and value for sequence_group may be different - just trying to illustrated the logic):
"buckets": [
{
"key": "a",
"sequence_group": 1,
"doc_count": 2
},
{
"key": "b",
"sequence_group": 3,
"doc_count": 1
},
{
"key": "a",
"sequence_group": 4,
"doc_count": 2
}
]
There is a good description of the problem and some SQL solution-approaches at https://www.simple-talk.com/sql/t-sql-programming/the-sql-of-gaps-and-islands-in-sequences/. I would like to know if there is a solution for elasticsearch available as well.

We can use Scripted Metric Aggregation here which works in a map-reduce fashion (Ref link). It has different parts like init, map, combine and reduce. And, the good thing is that the result of all of these could be a list or map too.
I played around a bit on this.
ElasticSearch version used: 7.1
Creating index:
PUT test
{
"mappings": {
"properties": {
"sequence": {
"type": "long"
},
"type": {
"type": "text",
"fielddata": true
}
}
}
}
Bulk indexing: (Note that I removed mapping type 'groupingTest')
POST _bulk
{ "index": { "_index": "test", "_id": "1" } }
{ "sequence": 1, "type": "A" }
{ "index": { "_index": "test", "_id": "2" } }
{ "sequence": 2, "type": "A" }
{ "index": { "_index": "test", "_id": "3" } }
{ "sequence": 3, "type": "B" }
{ "index": { "_index": "test", "_id": "4" } }
{ "sequence": 4, "type": "A" }
{ "index": { "_index": "test", "_id": "5" } }
{ "sequence": 5, "type": "A" }
Query
GET test/_doc/_search
{
"size": 0,
"aggs": {
"scripted_agg": {
"scripted_metric": {
"init_script": """
state.seqTypeArr = [];
""",
"map_script": """
def seqType = doc.sequence.value + '_' + doc['type'].value;
state.seqTypeArr.add(seqType);
""",
"combine_script": """
def list = [];
for(seqType in state.seqTypeArr) {
list.add(seqType);
}
return list;
""",
"reduce_script": """
def fullList = [];
for(agg_value in states) {
for(x in agg_value) {
fullList.add(x);
}
}
fullList.sort((a,b) -> a.compareTo(b));
def result = [];
def item = new HashMap();
for(int i=0; i<fullList.size(); i++) {
def str = fullList.get(i);
def index = str.indexOf("_");
def ch = str.substring(index+1);
def val = str.substring(0, index);
if(item["key"] == null) {
item["key"] = ch;
item["sequence_group"] = val;
item["doc_count"] = 1;
} else if(item["key"] == ch) {
item["doc_count"] = item["doc_count"] + 1;
} else {
result.add(item);
item = new HashMap();
item["key"] = ch;
item["sequence_group"] = val;
item["doc_count"] = 1;
}
}
result.add(item);
return result;
"""
}
}
}
}
And, finally the output:
{
"took" : 21,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"scripted_agg" : {
"value" : [
{
"doc_count" : 2,
"sequence_group" : "1",
"key" : "a"
},
{
"doc_count" : 1,
"sequence_group" : "3",
"key" : "b"
},
{
"doc_count" : 2,
"sequence_group" : "4",
"key" : "a"
}
]
}
}
}
Please note that scripted aggregation impacts a lot on the performance of the query. So, you might notice some slowness if there is a large no of documents.

You can always do an terms aggregation and then apply tops hit aggregation to get this.
{
"aggs": {
"types": {
"terms": {
"field": "type"
},
"aggs": {
"groups": {
"top_hits": {
"size": 10
}
}
}
}
}
}

Related

Aggregation on Latest Records Of same status in ElasticSearch

I Have following data in ElasticSearch index some_index.
[ {
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 1,
"cart_status": "new",
"grandTotal": 12,
"event": "some_event",
"timestamp": "2022-12-01T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 1,
"cart_status": "paid",
"grandTotal": 12,
"event": "some_event",
"timestamp": "2022-12-02T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 2,
"cart_status": "new",
"grandTotal": 23,
"event": "some_event",
"timestamp": "2022-12-01T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 2,
"cart_status": "paid",
"grandTotal": 23,
"event": "some_event",
"timestamp": "2022-12-04T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 3,
"cart_status": "new",
"grandTotal": 17,
"event": "some_event",
"timestamp": "2022-12-01T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 3,
"cart_status": "new",
"grandTotal": 17,
"event": "some_event",
"timestamp": "2022-12-04T00:00:00.000Z"
}
}
}
]
What I want to get is sum of the grandTotals by the latest cart_statuses of each cart within a given time range.
Having the example above, the result for timestamp >= 2022-12-01 00:00:00 and timestamp<= 2022-12-03 00:00:00 should be something like
cart_status:new, sum grandTotal: 40 because within that time range latest status new have cart_id 3 and 2.
and cart_status:paid, sum grandTotal: 12 and this one because paid is the latest status of only cart_id=1.
What I tried is to use sub-aggregation on top_result, top_hits but ElasticSearch complains that "Aggregator [top_result] of type [top_hits] cannot accept sub-aggregations"
Besides I tried with collapse as well to get the latest by status, but according to docs there is also no possibility to aggregate over the results of collapse.
Can someone please help me solving this, it seems like a common calculation but not very trivial in ElasticSearch.
In SQL this is quite easy with window functions.
I want to avoid persisting intermediate data into another index. Because I need the dynamic query, as the users may want to get their calculations for any time range.
you can try the following way. meanwhile, for card_status, sum value will be 52 as it includes card_id 1 that has "new" as card status along with 2 and 3 for given timestamp.
Mappings:
PUT some_index
{
"mappings" : {
"properties": {
"timestamp" : {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||strict_date_optional_time ||epoch_millis"
},
"cart_id" : {
"type": "keyword"
},
"cart_status" : {
"type": "keyword"
},
"grand_total" : {
"type": "long"
},
"event":{
"type": "keyword"
}
}
}
}
Bulk Insert:
POST _bulk
{ "index" : { "_index" : "some_index", "_id" : "1" } }
{ "cart_id" : "1" , "grand_total":12, "cart_status" : "new","timestamp":"2022-12-01T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "2" } }
{ "cart_id" : "1" , "grand_total":12, "cart_status" : "paid","timestamp":"2022-12-02T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "3" } }
{ "cart_id" : "2" , "grand_total":23, "cart_status" : "new","timestamp":"2022-12-01T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "4" } }
{ "cart_id" : "2" , "grand_total":23, "cart_status" : "paid","timestamp":"2022-12-04T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "5" } }
{ "cart_id" : "3" , "grand_total":17, "cart_status" : "new","timestamp":"2022-12-01T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "6" } }
{ "cart_id" : "3" , "grand_total":17, "cart_status" : "new","timestamp":"2022-12-04T00:00:00.000Z", "event" : "some_event"}
Query:
GET some_index/_search
{
"size":0,
"query": {
"bool": {
"filter": [
{
"range": {
"timestamp": {
"gte": "2022-12-01 00:00:00",
"lte": "2022-12-03 00:00:00"
}
}
}
]
}
},
"aggs": {
"card_status": {
"terms": {
"field": "cart_status"
},
"aggs": {
"grandTotal": {
"sum": {
"field": "grand_total"
}
}
}
}
}
}
Output:
{
"took": 86,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"card_status": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "new",
"doc_count": 3,
"grandTotal": {
"value": 52
}
},
{
"key": "paid",
"doc_count": 1,
"grandTotal": {
"value": 12
}
}
]
}
}
}

Count the number of elements in a list field in Elastic Search

I'm still learning to use DSL queries in ElasticSearch. I have documents where one field is a list. I need to count the number of documents that have one element in this field, two elements in this field, etc. For example, here is a document structure:
Document1:
"Volume": [
{
"partition": "s1",
"fieldtype": ["A","B"]
}
]
Document 2:
"Volume": [
{
"partition": "s1",
"fieldtype": ["A"]
}
]
Document 3:
"Volume": [
{
"partition": "s1",
"fieldtype": ["B"]
}
]
I need a way to calculate that there is one document with 2 elements in fieldtype field and 2 documents with one element in fieldtype.
If I try to aggregate them like this:
"size":0,
"aggs": {
"name": {
"terms": {
"field": "fieldtype.keyword"
}
}
}
I get counts of elements (number of As and Bs). Without using keyword, I get an error.
#rabbitbr provided a good answer, but I could not understand why we tried to use a nested field. And, I think we need to use terms aggregation instead of sum here. Anyhow, here is a solution without nested :
PUT idx_test
POST idx_test/_bulk
{"index":{ "_id": 1}}
{"Volume":[{"partition": "s1","fieldtype": ["A","B"]}]}
{"index":{ "_id": 2}}
{"Volume":[{"partition": "s1","fieldtype": ["A"]}]}
{"index":{ "_id": 3}}
{"Volume":[{"partition": "s1","fieldtype": ["B"]}]}
GET idx_test/_mapping
GET idx_test/_search
{
"size": 0,
"aggs": {
"size": {
"terms": {
"script": {
"lang": "painless",
"source": "doc['Volume.fieldtype.keyword'].size()"
}
}
}
}
}
Without using keyword, I get an error.
This is normal because without keyword you are trying to build an aggregation on a field whose type is text.
Here the response for the query above which is pretty basic query :
{
....
"aggregations": {
"size": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 2
},
{
"key": "2",
"doc_count": 1
}
]
}
}
}
As you can see, we have 2 documents with 1 sized array and we have 1 document with 2 sized array.
I imagined that you work with nested type. Below is my solution:
PUT idx_test
{
"mappings": {
"properties": {
"Volume": {
"type": "nested"
}
}
}
}
POST idx_test/_bulk
{"index":{ "_id": 1}}
{"Volume":[{"partition": "s1","fieldtype": ["A","B"]}]}
{"index":{ "_id": 2}}
{"Volume":[{"partition": "s1","fieldtype": ["A"]}]}
{"index":{ "_id": 3}}
{"Volume":[{"partition": "s1","fieldtype": ["B"]}]}
GET idx_test/_search
{
"size": 0,
"aggs": {
"doc_id": {
"terms": {
"field": "_id",
"size": 10
},
"aggs": {
"volumes": {
"nested": {
"path": "Volume"
},
"aggs": {
"size": {
"sum": {
"script": {
"lang": "painless",
"source": "doc['Volume.fieldtype.keyword'].size()"
}
}
}
}
}
}
}
}
}
Respons:
"aggregations" : {
"doc_id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"volumes" : {
"doc_count" : 1,
"size" : {
"value" : 2.0
}
}
},
{
"key" : "2",
"doc_count" : 1,
"volumes" : {
"doc_count" : 1,
"size" : {
"value" : 1.0
}
}
},
{
"key" : "3",
"doc_count" : 1,
"volumes" : {
"doc_count" : 1,
"size" : {
"value" : 1.0
}
}
}
]
}
}

How to get sum of diferent fields / array values in elasticsearch?

Using Elasticsearch 7.9.0
My document looks like this
{
"student": {
"marks": [
{
"sub": 80
},
{
"sub": 90
},
{
"sub": 100
}
]
}
}
I need one more field total_marks in the response of GET API
Something like this
{
"hits": [
{
"_index": "abc",
"_type": "_doc",
"_id": "blabla",
"_score": null,
"_source": {
"student": {
"marks": [
{
"sub": 80
},
{
"sub": 90
},
{
"sub": 100
}
]
}
},
"total_marks": 270
}
]
}
I tried using script_fields
My query is
GET sample/_search
{
"query": {
"match_all": {}
},
"script_fields": {
"total_marks": {
"script": {
"source": """double sum = 0.0;
for( item in params._source.student.marks)
{ sum = sum + item.sub }
return sum;"""
}
}
}
}
I got response as
{
"hits": [
{
"_index": "abc",
"_type": "_doc",
"_id": "blabla",
"_score": null,
"_source": {
"student": {
"marks": [
{
"sub": 80
},
{
"sub": 90
},
{
"sub": 100
}
]
}
},
"fields": {
"total_marks": [
270
]
}
}
]
}
Is thare any way to get as expected?
Any better/optimal solution would be helps a lot.
Thank you.
Terms aggregation and sum aggregation can be used to find total marks per group
{
"aggs": {
"students": {
"terms": {
"field": "student.id.keyword",
"size": 10
},
"aggs": {
"total_marks": {
"sum": {
"field": "student.marks.sub"
}
}
}
}
}
}
Result
"aggregations" : {
"students" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"total_marks" : {
"value" : 270.0
}
}
]
}
}
This will be faster than script but Pagination will be easier in query as compared to aggregation. So choose accordingly.
Best option may be to have it calculated at index time. If those fields are not changing frequently.

I want to get all Entities from nested JSON Data where the "ai_id" has the Value = 0

i have this bellow JSON Data, and i want to write a Query in Elasticsearch , the Query is ,
(Give me all Entities where the "ai_id" has the Value = 0 ).
the JSON Data ist :
{
"_index": "try1",
"_type": "_doc",
"_id": "2",
"_score": 1,
"_source": {
"target": {
"br_id": 0,
"an_id": 0,
"ai_id": 0,
"explanation": [
"element 1",
"element 2"
]
},
"process": {
"an_id": 1311,
"pa_name": "micha"
},
"text": "hello world"
}
},
{
"_index": "try1",
"_type": "_doc",
"_id": "1",
"_score": 1,
"_source": {
"target": {
"br_id": 0,
"an_id": 1,
"ai_id": 1,
"explanation": [
"element 3",
"element 4"
]
},
"process": {
"an_id": 1311,
"pa_name": "luca"
},
"text": "the all People are good"
}
}
]
}
}
I tried this but seems not to Work , Please any Help i will be thankfull.
GET try1\_search
{
"query":{
{ "match_all": { "ai_id": 0}}
}
}
and this did not work too,
GET try1/_search
{
"query": {
"nested" : {
"query" : {
"must" : [
{ "match" : {"ai_id" : 0} }
]
}
}
}
}
Please an Suggestion .
thx
You need to query nested on your target object like this-
GET /try1/_search
{
"query": {
"nested" : {
"path" : "target",
"query" : {
"bool" : {
"must" : [
{ "match" : {"target.ai_id" : 0} }
]
}
}
}
}
}
Ref. https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-nested-query.html

Make a flat array from Elasticsearch query results

I have an index with the following documents (simplified):
{
"user" : "j.johnson",
"certifications" : [{
"certification_date" : "2013-02-09T00:00:00+03:00",
"previous_level" : "No Level",
"obtained_level" : "Junior"
}, {
"certification_date" : "2014-05-26T00:00:00+03:00",
"previous_level" : "Junior",
"obtained_level" : "Middle"
}
]
}
I want just to have a flat list of all certifications passed by all users where certification_date > 2014-01-01. It should be a pretty large array like this:
[{
"certification_date" : "2014-09-08T00:00:00+03:00",
"previous_level" : "No Level",
"obtained_level" : "Junior"
}, {
"certification_date" : "2014-05-26T00:00:00+03:00",
"previous_level" : "Junior",
"obtained_level" : "Middle"
}, {
"certification_date" : "2015-01-26T00:00:00+03:00",
"previous_level" : "Junior",
"obtained_level" : "Middle"
}
...
]
It doesn't seems to be a hard task, but I wasn't able to find an easy way to do that.
I would do it with a parent/child relationship, though you will have to reorganize your data. I don't think you can get what you want with your current schema.
More concretely, I set up an index like this, with user as parent and certification as child:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"user": {
"properties": {
"user_name": { "type": "string" }
}
},
"certification":{
"_parent": { "type": "user" },
"properties": {
"certification_date": { "type": "date" },
"previous_level": { "type": "string" },
"obtained_level": { "type": "string" }
}
}
}
}
added some docs:
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"user","_id":1}}
{"user_name":"j.johnson"}
{"index":{"_index":"test_index","_type":"certification","_parent":1}}
{"certification_date" : "2013-02-09T00:00:00+03:00","previous_level" : "No Level","obtained_level" : "Junior"}
{"index":{"_index":"test_index","_type":"certification","_parent":1}}
{"certification_date" : "2014-05-26T00:00:00+03:00","previous_level" : "Junior","obtained_level" : "Middle"}
{"index":{"_index":"test_index","_type":"user","_id":2}}
{ "user_name":"b.bronson"}
{"index":{"_index":"test_index","_type":"certification","_parent":2}}
{"certification_date" : "2013-09-05T00:00:00+03:00","previous_level" : "No Level","obtained_level" : "Junior"}
{"index":{"_index":"test_index","_type":"certification","_parent":2}}
{"certification_date" : "2014-07-20T00:00:00+03:00","previous_level" : "Junior","obtained_level" : "Middle"}
Now I can just search certifications with a range filter:
POST /test_index/certification/_search
{
"query": {
"constant_score": {
"filter": {
"range": {
"certification_date": {
"gte": "2014-01-01"
}
}
}
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "certification",
"_id": "QGXHp7JZTeafWYzb_1FZiA",
"_score": 1,
"_source": {
"certification_date": "2014-05-26T00:00:00+03:00",
"previous_level": "Junior",
"obtained_level": "Middle"
}
},
{
"_index": "test_index",
"_type": "certification",
"_id": "yvO2A9JaTieI5VHVRikDfg",
"_score": 1,
"_source": {
"certification_date": "2014-07-20T00:00:00+03:00",
"previous_level": "Junior",
"obtained_level": "Middle"
}
}
]
}
}
This structure is still not completely flat the way you asked for, but I think this is as close as ES will let you get.
Here is the code I used:
http://sense.qbox.io/gist/3c733ec75e6c0856fa2772cc8f67bd7c00aba637

Resources