Elasticsearch aggregation with unqiue counting - elasticsearch

My documents consist of a history of orders and their state, here a minimal example:
{
"orderNumber" : "xyz",
"state" : "shipping",
"day" : "2022-07-20",
"timestamp" : "2022-07-20T15:06:44.290Z",
}
the state can be strings like shipping, processing, redo,...
For every possible state, I need to count the number of orders that had this state at some point during a day, without counting a state twice for the same orderNumber that day (which can happen if there is a problem and it needs to start from the beginning that same day).
My aggregation looks like this:
GET order-history/_search
{
"aggs": {
"countDays": {
"terms": {
"field": "day",
"order": {
"_key": "desc"
},
"size": 20
},
"aggs": {
"countStates": {
"terms": {
"field": "state.keyword",
"size": 10
}
}
}
}
}
, "size": 1
}
However, this will count a state for a given orderNumber twice if it reappears that same day. How would I prevent it from counting a state twice for each orderNumber, if it is on the same day?

Tldr;
I don't think there is a flexible and simple solution.
But if you know in advance the number of state that exists. Maybe through another aggregation query, to get all type of state.
You could do the following
POST /_bulk
{"index":{"_index":"73138766"}}
{"orderNumber":"xyz","state":"shipping","day":"2022-07-20"}
{"index":{"_index":"73138766"}}
{"orderNumber":"xyz","state":"redo","day":"2022-07-20"}
{"index":{"_index":"73138766"}}
{"orderNumber":"xyz","state":"shipping","day":"2022-07-20"}
{"index":{"_index":"73138766"}}
{"orderNumber":"bbb","state":"processing","day":"2022-07-20"}
{"index":{"_index":"73138766"}}
{"orderNumber":"bbb","state":"shipping","day":"2022-07-20"}
GET 73138766/_search
{
"size": 0,
"aggs": {
"per_day": {
"date_histogram": {
"field": "day",
"calendar_interval": "day"
},
"aggs": {
"shipping": {
"filter": { "term": { "state.keyword": "shipping" }
},
"aggs": {
"orders": {
"cardinality": {
"field": "orderNumber.keyword"
}
}
}
},
"processing": {
"filter": { "term": { "state.keyword": "processing" }
},
"aggs": {
"orders": {
"cardinality": {
"field": "orderNumber.keyword"
}
}
}
},
"redo": {
"filter": { "term": { "state.keyword": "redo" }
},
"aggs": {
"orders": {
"cardinality": {
"field": "orderNumber.keyword"
}
}
}
}
}
}
}
}
You will obtain the following results
{
"aggregations": {
"per_day": {
"buckets": [
{
"key_as_string": "2022-07-20T00:00:00.000Z",
"key": 1658275200000,
"doc_count": 5,
"shipping": {
"doc_count": 3,
"orders": {
"value": 2
}
},
"processing": {
"doc_count": 1,
"orders": {
"value": 1
}
},
"redo": {
"doc_count": 1,
"orders": {
"value": 1
}
}
}
]
}
}
}

Related

How to get the last Elasticsearch document for each unique value of a field?

I have a data structure in Elasticsearch that looks like:
{
"name": "abc",
"date": "2022-10-08T21:30:40.000Z",
"rank": 3
}
I want to get, for each unique name, the rank of the document (or the whole document) with the most recent date.
I currently have this:
"aggs": {
"group-by-name": {
"terms": {
"field": "name"
},
"aggs": {
"max-date": {
"max": {
"field": "date"
}
}
}
}
}
How can I get the rank (or the whole document) for each result, and if possible, in 1 request ?
You can use below options
Collapse
"collapse": {
"field": "name"
},
"sort": [
{
"date": {
"order": "desc"
}
}
]
Top hits aggregation
{
"aggs": {
"group-by-name": {
"terms": {
"field": "name",
"size": 100
},
"aggs": {
"top_doc": {
"top_hits": {
"sort": [
{
"date": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}

Is it possible to fetch count of total number of docs that contain a qualifying aggregation condition in elasticsearch?

I use ES v7.3 and as per my requirements I am aggregating some fields to fetch the required docs in response, further their is a requirement to fetch the count of total number of all such docs also that contain the nested field which qualifies the aggregation condition as described below but I did not find a way where I am able to do that.
Current aggregation query that I am using to fetch the documents is,
"aggs": {
"users": {
"composite": {
"sources": [
{
"users": {
"terms": {
"field": "co_profileId.keyword"
}
}
}
],
"size": 5000
},
"aggs": {
"sessions": {
"nested": {
"path": "co_score"
},
"aggs": {
"last_4_days": {
"filter": {
"range": {
"co_score.sessionTime": {
"gte": "2021-01-10T00:00:31.399Z",
"lte": "2021-01-14T01:37:31.399Z"
}
}
},
"aggs": {
"score_count": {
"sum": {
"field": "co_score.value"
}
}
}
}
}
},
"page_view_count_filter": {
"bucket_selector": {
"buckets_path": {
"sessionCount": "sessions > last_4_days > score_count"
},
"script": "params.sessionCount > 100"
}
},
"filtered_users": {
"top_hits": {
"size": 1,
"_source": {
"includes": [
"co_profileId",
"co_type",
"co_score"
]
}
}
}
}
}
}
Sample doc:
{
"co_profileId": "14654325",
"co_type": "identify",
"co_updatedAt": "2021-01-11T11:37:33.499Z",
"co_score": [
{
"value": 3,
"sessionTime": "2021-01-09T01:37:31.399Z"
},
{
"value": 3,
"sessionTime": "2021-01-10T10:47:33.419Z"
},
{
"value": 6,
"sessionTime": "2021-01-11T11:37:33.499Z"
}
]
}

Perform multi-field / multi-dimensional aggregations with nested fields in Elastic Search

I am tracking attendance of few students. I am storing their details in the index like the below.
Each doc in "entries" have few other fields. The following data shows that a student has attended 6 classes on "Monday".
"entries" is of type "nested"
{
reg_id: 1111,
"entires" : [
{
id: "123"
day: 'Monday'
},
{
id: "1234",
attendance: true
},
{
id: "12345",
classes_attended: 6
}
],
}
I want the count of each classes_attended of students for each day.
For Example "72 entries of students found for "Monday", who has attended 6 classes"
Sample desired output - This is just a sample I am completely fine if the output schema is changed.
[
{
"day" : "monday",
"classes_attended": 6,
count: 4
},
{
"day" : "monday",
"classes_attended": 1,
count: 5
},
{
"day" : "tuesday",
"classes_attended": 5,
count: 2
},
{
"day" : "tuesday",
"classes_attended": 6,
count: 1
}
]
Not sure How to start with the aggregations query:
I tried with the following query but I know its not the correct solution
"aggs": {
"attendance_aggs": {
"nested": {
"path": "entries"
},
"aggs": {
"days": {
"terms": {
"field": "entries.day"
},
"aggs": {
"attended": {
"reverse_nested": {},
"aggs":{
"class_attended_day": {
"terms": {
"field": "entries.classes_attended"
},
"aggs": {
"class_attended_days_count": {
"reverse_nested": {},
"aggs": {
"classes_attended_final": {
"cardinality": {
"field": "entries.class_attended"
}
}
}
}
}
}
}
}
}
}
}
}
}
It's unclear what the top-level object is so I'm going to assume it's a "student attendance entry per day". I'm also unsure what the entries.ids represent but I'll assume you'll be needing them at some later point so I'll keep them untouched.
Now, since all that your entries objects have in common is the id, they can be decoupled. Meaning that you should be using nested if any only if you share some attributes across all objects which need their attribute connections preserved. Since I don't see entries.id anywhere in your aggs, I'd recommend the following adjustments to your mapping:
PUT students
{
"mappings": {
"properties": {
"day": { ------------
"type": "keyword" |
}, |
"attendance": { |
"type": "boolean" | <--
}, |
"classes_attended": { |
"type": "integer" |
}, ------------
"entries": {
"type": "nested",
"properties": {
"day": {
"type": "keyword",
"copy_to": "day" <--
},
"attendance": {
"type": "boolean",
"copy_to": "attendance" <--
},
"classes_attended": {
"type": "integer",
"copy_to": "classes_attended" <--
}
}
}
}
}
}
and here's your query:
GET students/_search
{
"size": 0,
"aggs": {
"days": {
"terms": {
"field": "day"
},
"aggs": {
"classes_attended": {
"terms": {
"field": "classes_attended"
},
"aggs": {
"student_count": {
"cardinality": {
"field": "_id"
}
}
}
}
}
}
}
}
The response can then be post-processed into whatever you prefer.
EDIT
You could hijack reverse_nested but will need to come back to it as you're referencing other nested entries:
GET students/_search
{
"size": 0,
"aggs": {
"attendance_aggs": {
"nested": {
"path": "entries"
},
"aggs": {
"days": {
"terms": {
"field": "entries.day"
},
"aggs": {
"attended": {
"reverse_nested": {},
"aggs": {
"class_attended_day": {
"nested": {
"path": "entries"
},
"aggs": {
"class_attended_day": {
"terms": {
"field": "entries.classes_attended"
},
"aggs": {
"classes_attended_final": {
"cardinality": {
"field": "entries.classes_attended"
}
}
}
}
}
}
}
}
}
}
}
}
}
}

Filter based on different values for the same field in different documents

Let's say I have the following data:
{
"id":"1",
"name": "John",
"tag":"x"
},
{
"id": 2,
"name":"John",
"tag":"y"
},
{
"id": 3,
"name":"Jane",
"tag":"x"
}
I want to get the count of documents (unique on name) that has both tag = "x" and tag = "y"
Given the above data, the query should return 1, because only John has two documents exists that has the two required tags.
What I am able to do so far is a query that uses OR ( so either tag = "x" or tag = "y") which will return 2. For example:
"aggs": {
"distict_count": {
"filter": {
"terms": {
"tag": [
"x",
"y"
]
}
},
"aggs": {
"agg_cardinality_name": {
"cardinality": {
"field": "name"
}
}
}
}
}
Would it be possible to change that to use and instead of or?
Try putting cardinality under a terms agg to get proper distinct counts:
{
"size": 0,
"aggs": {
"distict_count": {
"filter": {
"terms": {
"tag": [
"x",
"y"
]
}
},
"aggs": {
"agg_terms": {
"terms": {
"field": "name"
},
"aggs": {
"agg_cardinality_name": {
"cardinality": {
"field": "name"
}
}
}
}
}
}
}
}
CORRECTION
You can use a combination of cardinality aggs with a bucket_selector which'll rule out buckets where there are fewer than 2 unique tags -- i.e. both x and y:
{
"size": 0,
"aggs": {
"distict_count": {
"filter": {
"terms": {
"tag": [
"x",
"y"
]
}
},
"aggs": {
"agg_terms": {
"terms": {
"field": "name"
},
"aggs": {
"agg_cardinality_tag2": {
"bucket_selector": {
"buckets_path": {
"unique_tags_count": "unique_tags_count"
},
"script": "params.unique_tags_count > 1"
}
},
"unique_tags_count": {
"cardinality": {
"field": "tag"
}
},
"unique_names_count": {
"cardinality": {
"field": "name"
}
}
}
}
}
}
}
}

Subtract numeric fields between two documents with different timestamp

Lets say I have these data samples:
{
"date": "2019-06-16",
"rank": 150
"name": "doc 1"
}
{
"date": "2019-07-16",
"rank": 100
"name": "doc 1"
}
{
"date": "2019-06-16",
"rank": 50
"name": "doc 2"
}
{
"date": "2019-07-16",
"rank": 80
"name": "doc 2"
}
The expected result is by subtracting the rank field from two same name of docs with different date (old date - new date):
{
"name": "doc 1",
"diff_rank": 50
}
{
"name": "doc 2",
"diff_rank": -30
}
And sort by diff_rank if possible, otherwise I will just sort manually after getting the result.
What I have tried is by using date_histogram and serial_diff but some results are missing the diff_rank value in somehow which I am sure the data exist:
{
"aggs" : {
"group_by_name": {
"terms": {
"field": "name"
},
"aggs": {
"days": {
"date_histogram": {
"field": "date",
"interval": "day"
},
"aggs": {
"the_rank": {
"sum": {
"field": "rank"
}
},
"diff_rank": {
"serial_diff": {
"buckets_path": "the_rank",
"lag" : 30 // 1 month or 30 days in this case
}
}
}
}
}
}
}
}
The help will be much appreciated to solve my issue above!
Finally, I found a method from official doc using Filter, Bucket Script aggregation and Bucket Sort to sort the result. Here is the final snippet code:
{
"size": 0,
"aggs" : {
"group_by_name": {
"terms": {
"field": "name",
"size": 50,
"shard_size": 10000
},
"aggs": {
"last_month_rank": {
"filter": {
"term": {"date": "2019-06-17"}
},
"aggs": {
"rank": {
"sum": {
"field": "rank"
}
}
}
},
"latest_rank": {
"filter": {
"term": {"date": "2019-07-17"}
},
"aggs": {
"rank": {
"sum": {
"field": "rank"
}
}
}
},
"diff_rank": {
"bucket_script": {
"buckets_path": {
"lastMonthRank": "last_month_rank>rank",
"latestRank": "latest_rank>rank"
},
"script": "params.lastMonthRank - params.latestRank"
}
},
"rank_bucket_sort": {
"bucket_sort": {
"sort": [
{"diff_rank": {"order": "desc"}}
],
"size": 50
}
}
}
}
}
}

Resources