ElasticSearch: How to make an aggregation pipeline? - elasticsearch

Imagine the following use case:
We work at Stark Airlines and our marketing team wants to segment our passengers in order to give them discounts or gift cards. They decide that they want two sets of passengers:
Passengers that fly at least 3 times per week
Passenger who have flown at least once but who have not flown for two weeks
With this they can make different marketing campaigns for our passengers!
So, in elastic search we have a trip index that represents a ticket bought by a passenger:
{
"_index" : "trip",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"total_amount" : 300,
"trip_date" : "2020/03/24 13:30:00",
"status" : "completed",
"passenger" : {
"id" : 11,
"name" : "Thiago nunes"
}
}
}
The trip index contains a status field that may have other values like: pending or open or canceled
This means that we can only take into account trips that has the completed status (Meaning the passenger did travel).
So, with all this in mind...How would I get those two sets of passengers with elastic search?
I have been trying for a while but with no success.
What I have done until now:
I have built a query that gets all valid trip (trips with status completed)
GET /trip/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"status": {
"value": "completed"
}
}
}
]
}
},
"aggs": {
"status_viagem": {
"terms": {
"field": "status.keyword"
}
}
}
}
This query returns the following:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 200,
"relation" : "eq"
},
"max_score" : 0.18232156,
"hits" : [...]
},
"aggregations" : {
"status_viagem" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "completed",
"doc_count" : 200
}
]
}
}
}
But I am stuck and can't figure out the next step. I know that the next thing to do should create buckets with passengers and then filter them in two buckets representing our desired data sets. But I don't know how.
Can someone help?
PS.:
I don't exactly need this to be one single query, just a hint about how to build a query like this would be very helpful
THE OUPUT SHOULD BE AN ARRAY of passenger id's
Note: I have shortened the trip index for the sake of simplicity

As per my understanding of your issue.
I have used date_histogram with interval as week to get collection on passengers which week. Only those passengers are kept which have three documents in a week. This will give you all passengers which have traveled thrice in a week.
In another aggregation I have use terms aggregation to get passengers and their last travel date. Using bucket selector have kept passengers whose last travel is not beyond certain date.
Mapping
{
"index87" : {
"mappings" : {
"properties" : {
"passengerid" : {
"type" : "long"
},
"passengername" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"status" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"total_amount" : {
"type" : "long"
},
"trip_date" : {
"type" : "date"
}
}
}
}
}
Query
{
"query": {
"bool": {
"must": [
{
"term": {
"status": {
"value": "completed"
}
}
}
]
}
},
"aggs": {
"travel_thrice_week": {
"date_histogram": {
"field": "trip_date",
"interval": "week"
},
"aggs": {
"passenger": {
"terms": {
"field": "passengername.keyword",
"min_doc_count": 3,
"size": 10
}
},
"select_bucket_with_user": {-->to keep weeks which have a pasenger with thrice
--> a day travel
"bucket_selector": {
"buckets_path": {
"passenger": "passenger._bucket_count"
},
"script": "if(params['passenger']>=1) {return true;} else{ return false;} "
}
}
}
},
"not_flown_last_two_week": {
"terms": {
"field": "passengername.keyword",
"size": 10
},
"aggs": {
"last_travel": {
"max": {
"field": "trip_date" --> most recent travel
}
},
"last_travel_before_two_week": {
"bucket_selector": {
"buckets_path": {
"traveldate": "last_travel"
},
"script":{
"source": "if(params['traveldate']< params['date_epoch']) return true; else return false;",
"params": {
"date_epoch":1586408336000 --> unix epoc of cutt off date
}
}
}
}
}
}
}
}
Result:
"aggregations" : {
"not_flown_last_two_week" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Thiago nunes",
"doc_count" : 3,
"last_travel" : {
"value" : 1.5851808E12,
"value_as_string" : "2020-03-26T00:00:00.000Z"
}
},
{
"key" : "john doe",
"doc_count" : 1,
"last_travel" : {
"value" : 1.5799968E12,
"value_as_string" : "2020-01-26T00:00:00.000Z"
}
}
]
},
"travel_thrice_week" : {
"buckets" : [
{
"key_as_string" : "2020-03-23T00:00:00.000Z",
"key" : 1584921600000,
"doc_count" : 3,
"passenger" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Thiago nunes",
"doc_count" : 3
}
]
}
}
]
}
}

Related

Can Elastic Search do aggregations for within a document?

I have a mapping like this:
mappings: {
"seller": {
"properties" : {
"overallRating": {"type" : byte}
"items": [
{
itemName: {"type": string},
itemRating: {"type" : byte}
}
]
}
}
}
Each item will only have one itemRating. Each seller will only have one overall rating. There can be many items, and at most I'm expecting maybe 50 items with itemRatings. Not all items have to have an itemRating.
I'm trying to get an average rating for each seller that combines all itemRatings and the overallRating. I have looked into aggregations but all I have seen are aggregations for across all documents. The aggregation I'm looking to do is within the document itself, and I am not sure if that is possible. Any tips would be appreciated.
Yes this is very much possible with Elasticeasrch. To produce a combined rating, you simply need to subaggregate by the document id. The only thing present in the bucket would be the individual document . That is what you want.
Here is an example:
Create the index:
PUT /ratings
{
"mappings": {
"properties": {
"overallRating": {"type" : "float"},
"items": {
"type" : "nested",
"properties": {
"itemName" : {"type" : "keyword"},
"itemRating" : {"type" : "float"},
"overallRating": {"type" : "float"}
}
}
}
}
}
Add some data:
POST ratings/_doc/
{
"overallRating" : 1,
"items" : [
{
"itemName" : "labrador",
"itemRating" : 10,
"overallRating" : 1
},
{
"itemName" : "saint bernard",
"itemRating" : 20,
"overallRating" : 1
}
]
}
{
"overallRating" : 1,
"items" : [
{
"itemName" : "cat",
"itemRating" : 5,
"overallRating" : 1
},
{
"itemName" : "rat",
"itemRating" : 10,
"overallRating" : 1
}
]
}
Query the index for a combined rating and sort by the rating:
GET ratings/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"average_rating": {
"composite": {
"sources": [
{
"ids": {
"terms": {
"field": "_id"
}
}
}
]
},
"aggs": {
"average_rating": {
"nested": {
"path": "items"
},
"aggs": {
"avg": {
"avg": {
"field": "items.compound"
}
}
}
}
}
}
},
"runtime_mappings": {
"items.compound": {
"type": "double",
"script": {
"source": "emit(doc['items.overallRating'].value + doc['items.itemRating'].value)"
}
}
}
}
The result (Pls note that i changed the exact values of ratings between writing the answer and running it in the console, so the averages are a bit different)
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"average_rating" : {
"after_key" : {
"ids" : "3vUp44EBbR3hrRYkA8pj"
},
"buckets" : [
{
"key" : {
"ids" : "3_Up44EBbR3hrRYkLsrC"
},
"doc_count" : 1,
"average_rating" : {
"doc_count" : 2,
"avg" : {
"value" : 151.0
}
}
},
{
"key" : {
"ids" : "3vUp44EBbR3hrRYkA8pj"
},
"doc_count" : 1,
"average_rating" : {
"doc_count" : 2,
"avg" : {
"value" : 8.5
}
}
}
]
}
}
}
One change for convenience:
I edited your mappings to add the overAllRating to each Item entry. This simplifies the calculations that come subsequently, simply because you only look in the nested scope and never have to step out.
I also had to use a "runtime mapping" to combine the value of each overAllRating and ItemRating, to produce a better average. I basically made a sum of every ItemRating with the OverAllRating and averaged those across every entry.
I had to use a top level composite "id" aggregation so that we only get results per document (which is what you want).
There is some pretty heavy lifting happening here, but it is very possible and easy to edit this as you require.
HTH.

How to return hit term in ES ?

I try to return only the terms that were successfully hit instead of the document itself, but I don’t know how to achieve the desired effect。
"es_episode" : {
"aliases" : { },
"mappings" : {
"properties" : {
"endTime" : {
"type" : "long"
},
"episodeId" : {
"type" : "long"
},
"startTime" : {
"type" : "long"
},
"studentIds" : {
"type" : "long"
}
}
}
This is an example:
{
"episodeId":124,
"startTime":10,
"endTime":20,
"studentIds":[200,300]
}
My query:
GET /es_episode/_search
{
"_source": ["studentIds"],
"query": {
"terms": {
"studentIds": [300,400]
}
}
}
The result is
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "es_episode",
"_type" : "episode",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"studentIds" : [
200,
300
]
}
}
]
}
But in fact I only want to know which term hits. For example, the result I want should be studentIds=[300] instead of all studentIds=[200,300] of the returned document. It seems that some additional operations are required, but I don’t know
how.
I try to achieve my goal with the following query
GET /es_episode/_search
{
"_source": ["studentIds"],
"query": {
"terms": {
"studentIds": [300,400]
}
},
"aggs": {
"student_id": {
"terms": {
"field": "studentIds",
"size": 10
},
"aggs": {
"id": {
"terms": {
"field": "episodeId"
}
},
"id_select":{
"bucket_selector": {
"buckets_path": {
"key" : "_key"
},
"script": "params.key==300 || params.key==400"
}
}
}
}
}
}
the result for this is
"aggregations" : {
"student_id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 300,
"doc_count" : 1,
"id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 124,
"doc_count" : 1
}
]
}
}
]
}
}
It seems that I successfully filtered out the terms I don’t want, but this doesn’t look pretty, and I need to set my parameters repeatedly in the script

Elasticsearch aggregation on different search in same query

I want to make a query to aggregate base only on match no matter what other parameters(terms , term , etc...) are used.
To be more specific I have an online shop where I use multiple filters (color ,size etc..) If I check a field for example color : red the other colors are no longer aggregated.
A solution that I am using is to make 2 separated queries (one for search where filters are applied and other for aggregation. Any idea how can I combine the 2 separated queries ?
You can take advantage of post_filter which will not apply to your aggregations but will only filter the to-be-returned hits. For example:
Create a shop
PUT online_shop
{
"mappings": {
"properties": {
"color": {
"type": "keyword"
},
"size": {
"type": "integer"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Populate it w/ a few products
POST online_shop/_doc
{"color":"red","size":35,"name":"Louboutin High heels abc"}
POST online_shop/_doc
{"color":"black","size":34,"name":"Louboutin Boots abc"}
POST online_shop/_doc
{"color":"yellow","size":36,"name":"XYZ abc"}
Apply a shared query to the hits as well as aggregations and use post_filter to ... post-filter the hits:
GET online_shop/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
}
]
}
},
"aggs": {
"by_color": {
"terms": {
"field": "color"
}
},
"by_size": {
"terms": {
"field": "size"
}
}
},
"post_filter": {
"bool": {
"must": [
{
"term": {
"color": {
"value": "red"
}
}
}
]
}
}
}
Expected result
{
...
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.11750763,
"hits" : [
{
"_index" : "online_shop",
"_type" : "_doc",
"_id" : "cehma3IBG_KW3EFn1QYa",
"_score" : 0.11750763,
"_source" : {
"color" : "red",
"size" : 35,
"name" : "Louboutin High heels abc"
}
}
]
},
"aggregations" : {
"by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "black",
"doc_count" : 1
},
{
"key" : "red",
"doc_count" : 1
},
{
"key" : "yellow",
"doc_count" : 1
}
]
},
"by_size" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 34,
"doc_count" : 1
},
{
"key" : 35,
"doc_count" : 1
},
{
"key" : 36,
"doc_count" : 1
}
]
}
}
}

How to select the last bucket in a date_histogram selector in Elasticsearch

I have a date_histogram and I can use max_bucket to get the bucket with the greatest value, but I want to select the last bucket (i.e. the bucket with the highest timestamp).
Using max_bucket to get the greatest value works OK, but I don't know what to put in the buckets_path to get the last bucket.
My mapping:
{
"ee-2020-02-28" : {
"mappings" : {
"dynamic" : "strict",
"properties" : {
"date" : {
"type" : "date"
},
"frequency" : {
"type" : "long"
},
"keyword" : {
"type" : "keyword"
},
"text" : {
"type" : "text"
}
}
}
}
}
My working query, which returns the bucket for the day with higher frequency (it's named last_day because this is a WIP query to get to my goal):
{
"query": {
"range": {
"date": { /* Start away from the begining of data, so the rolling avg is full */
"gte": "2019-02-18"/*,
"lte": "2020-12-14"*/
}
}
},
"aggs": {
"palabrejas": {
"terms": {
"field": "keyword",
"size": 100
},
"aggs": {
"nnndiario": {
"date_histogram": {
"field": "date",
"calendar_interval": "day"
},
"aggs": {
"dailyfreq": {
"sum": {
"field": "frequency"
}
}
}
},
"ventanuco": {
"avg_bucket": {
"buckets_path": "nnndiario>dailyfreq",
"gap_policy": "insert_zeros"
}
},
"last_day": {
"max_bucket": {
"buckets_path": "nnndiario>dailyfreq"
}
}
}
}
}
}
Its output (notice I replaced long parts with [...]):
{
"aggregations" : {
"palabrejas" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "rama0",
"doc_count" : 20400,
"nnndiario" : {
"buckets" : [
{
"key_as_string" : "2020-01-01T00:00:00.000Z",
"key" : 1577836800000,
"doc_count" : 600,
"dailyfreq" : {
"value" : 3000.0
}
},
{
"key_as_string" : "2020-01-02T00:00:00.000Z",
"key" : 1577923200000,
"doc_count" : 600,
"dailyfreq" : {
"value" : 3000.0
}
},
{
"key_as_string" : "2020-01-03T00:00:00.000Z",
"key" : 1578009600000,
"doc_count" : 600,
"dailyfreq" : {
"value" : 3000.0
}
},
[...]
{
"key_as_string" : "2020-01-31T00:00:00.000Z",
"key" : 1580428800000,
"doc_count" : 600,
"dailyfreq" : {
"value" : 3000.0
}
}
]
},
"ventanuco" : {
"value" : 3290.3225806451615
},
"last_day" : {
"value" : 12000.0,
"keys" : [
"2020-01-13T00:00:00.000Z"
]
}
},
{
"key" : "rama1",
"doc_count" : 20400,
"nnndiario" : {
"buckets" : [
{
"key_as_string" : "2020-01-01T00:00:00.000Z",
"key" : 1577836800000,
"doc_count" : 600,
"dailyfreq" : {
"value" : 3000.0
}
},
[...]
]
},
"ventanuco" : {
"value" : 3290.3225806451615
},
"last_day" : {
"value" : 12000.0,
"keys" : [
"2020-01-13T00:00:00.000Z"
]
}
},
[...]
}
]
}
}
}
I don't know what to put in last_day's buckets_path to obtain the last bucket.
You might consider using a terms aggregation instead of a date_histogram-aggregation:
"max_date_bucket_agg": {
"terms": {
"field": "date",
"size": 1,
"order": {"_key": "desc"}
}
}
An issue might be the granularity of your data, you may consider storing the date-value of the expected granularity (e.g. day) in a separate field and use that field in the terms-aggregation.

Aggregation on .keyword to return only the keys that contain a specific string

New to aggregations in elasticsearch. Using 7.2. I am trying to write an aggregation on Tree.keyword to only return the count of documents that have a key that contains the word "Branch". I have tried sub aggregations, bucket_selector (which doesnt work for key strings) and scripts. Anyone have any ideas or suggestions on how to approach this?
Mapping:
{
"testindex" : {
"mappings" : {
"properties" : {
"Tree" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
}
}
}
Example Query that returns all the keys but what I need to do is limit to only return keys with "Branch" or better yet just the count of how many "Branch" keys there are:
GET testindex/_search
{
"aggs": {
"bucket": {
"terms": {
"field": "Tree.keyword"
}
}
}
}
Returns:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "testindex",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"Tree" : [
"Car:76",
"Branch:yellow",
"Car:one",
"Branch:blue"
]
}
}
]
},
"aggregations" : {
"bucket" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Car:76",
"doc_count" : 1
},
{
"key" : "Branch:yellow",
"doc_count" : 1
},
{
"key" : "Car:one",
"doc_count" : 1
},
{
"key" : "Branch:blue",
"doc_count" : 1
}
]
}
}
}
You have to add includes for limit result. Here's the code sample and hopefully this should help you.
GET testindex/_search
{
"_source": {
"includes": [
"Branch"
]
},
"aggs": {
"bucket": {
"terms": {
"field": "Tree.keyword"
}
}
}
}
It is possible to filter the values for which buckets will be created. This can be done using the include and exclude parameters which are based on regular expression strings or arrays of exact values. Additionally, include clauses that can filter using partition expressions.
For your case, it should be like this,
GET testindex/_search
{
"aggs": {
"bucket": {
"terms": {
"field": "Tree.keyword",
"include": "Branch:*"
}
}
}
}
Thanks for all the help! Unfortunately, none of those solutions worked for me. I ended up using a script to return all the branches and then setting everything else into a new key. Then used a bucket script to subtract 1 in Total_Buckets. Probably a better solution out there but hopefully it helps someone
GET testindex/_search
{
"aggs": {
"bucket": {
"cardinality": {
"field": "Tree.keyword",
"script": {
"lang": "painless",
"source": "if(_value.contains('Branches:')) { return _value} return 1;"
}
}
},
"Total_Branches": {
"bucket_script": {
"buckets_path": {
"my_var1": "bucket.value"
},
"script": "return params.my_var1-1"
}
}
}
}

Resources