Following are the sample document at elasticsearch.
{
"_index": “social”,
"_type": “social”,
"_id": "1632560884596186633",
"_score": 1,
"_source": {
"created_date": "2017-10-24",
"reach": 1692,
"social_id": 200
}
},
{
"_index": “social”,
"_type": “social”,
"_id": "1626693964184981799",
"_score": 1,
"_source": {
"created_date": "2017-10-25”,
"reach": 1692,
“social_id": 100
}
},
{
"_index": “social”,
"_type": “social”,
"_id": "162669396418498170",
"_score": 1,
"_source": {
"created_date": "2017-10-25”,
"reach": 1692,
“social_id": 50
}
},
{
"_index": “social”,
"_type": “social”,
"_id": "1626693964184981756",
"_score": 1,
"_source": {
"created_date": "2017-10-25”,
"reach": 1692,
“social_id": 25
}
}
Question: Sum of reach for top 2 documents based on the created date per social id.
What I have tried:
{
"size": 0,
"aggs": {
"reach_bucket": {
"terms": {
"size": 200,
"field": "social_id"
},
"aggs": {
"media_reach_bucket": {
"terms": {
"field": "created_date",
"size": 200
},
"aggs": {
"top_sales_hits": {
"top_hits": {
"sort": [
{
"created_date": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"created_date",
"reach"
]
},
"size": 2
}
}
}
}
}
}
}
}
Problem:
Not to do sub aggregation for top_hits.
Any suggestion will be grateful.
You may want to use date_histogram instead of terms when bucketing per day (I assume). But more importantly, you should sort your top_hits by reach, not created_date, since that's going to be the same in your per-day bucket.
{
"size": 0,
"aggs": {
"reach_bucket": {
"terms": {
"size": 200,
"field": "social_id"
},
"aggs": {
"media_reach_bucket": {
"date_histogram": {
"field": "created_date",
"calendar_interval": "day"
},
"aggs": {
"top_sales_hits": {
"top_hits": {
"sort": [
{
"reach": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"reach"
]
},
"size": 2
}
}
}
}
}
}
}
}
yielding top hits like so
"aggregations" : {
"reach_bucket" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 100,
"doc_count" : 4,
"media_reach_bucket" : {
"buckets" : [
{
"key_as_string" : "2017-10-24T00:00:00.000Z",
"key" : 1508803200000,
"doc_count" : 4,
"top_sales_hits" : {
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "kart",
"_type" : "_doc",
"_id" : "3iLJRnEBZbobBB0NiV8R",
"_score" : null,
"_source" : {
"reach" : 40
},
"sort" : [
40
]
},
{
"_index" : "kart",
"_type" : "_doc",
"_id" : "3SLJRnEBZbobBB0Nhl-Y",
"_score" : null,
"_source" : {
"reach" : 30
},
"sort" : [
30
]
}
]
}
}
}
]
}
}
]
}
}
whose reach you can then sum up in your post processing functions.
I'm not familiar with a top-n sum, only with sums of docs above a certain threshold -- in which case I'd use filter aggregations.
Related
I've prepared an Elastic Search query in which I'm trying to fetch results from nested objects. The query looks something like this:
{
"from": 0,
"size": 100,
"_source": {
"excludes": [
"#version"
]
},
"query": {
"bool": {
"must": [
{
"term": {
"doc.workflow_id.keyword": "workflow1"
}
},
{
"nested": {
"path": "doc.attributes",
"query": {
"bool": {
"filter": [
{
"match": {
"doc.attributes.name": "color"
}
},
{
"bool": {
"should": [
{
"wildcard": {
"doc.attributes.value.rawold": "*green*"
}
}
]
}
}
]
}
}
}
},
{
"nested": {
"path": "doc.attributes",
"query": {
"bool": {
"filter": [
{
"match": {
"doc.attributes.name": "price"
}
},
{
"bool": {
"should": [
{
"wildcard": {
"doc.attributes.value.rawold": "*34*"
}
}
]
}
}
]
}
}
}
}
],
"must_not": []
}
}
}
Output:
"hits" : [
{
"_index" : "sample_index",
"_type" : "_doc",
"_id" : "mv1",
"_score" : null,
"_source" : {
"doc" : {
"workflow_id" : "workflow1",
"attributes" : [
{
"name" : "price",
"value" : "34"
},
{
"name" : "weight",
"value" : "10"
},
{
"name" : "color",
"value" : "green"
},
{
"name" : "city",
"value" : "#error"
}
]
}
}
},
{
"_index" : "sample_index",
"_type" : "_doc",
"_id" : "mv2",
"_score" : null,
"_source" : {
"doc" : {
"workflow_id" : "workflow1",
"attributes" : [
{
"name" : "price",
"value" : "34"
},
{
"name" : "color",
"value" : "green"
}
]
}
}
}
]
I've omitted a few trivial details in query and output for simplicity. The attributes array in the response is of type nested and contains name and value fields of type string.
I've put filters on attributes color and price, but as you can see, I'm getting other attributes too in the attributes array. Can I somehow pass specific attribute names to the ES query and get the value of those attributes only?
I tried using inner_hits in both nested queries, but it returns the attribute value only for the passed attribute name in the nested query.
E.g.
{
"nested": {
"path": "doc.attributes",
"query": {
"bool": {
"filter": [
{
"match": {
"doc.attributes.name": "color"
}
},
{
"bool": {
"should": [
{
"wildcard": {
"doc.attributes.value.rawold": "*green*"
}
}
]
}
}
]
}
},
"inner_hits": {
"name": "two",
"_source": [
"doc.product_attributes.name",
"doc.product_attributes.value"
]
}
}
}
gives result
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "sample_index",
"_type": "_doc",
"_id": "mv1",
"_score": null,
"_source": {
"doc": {
"workflow_id": "workflow1",
"attributes": [
{
"name": "price",
"value": "34"
},
{
"name": "weight",
"value": "34"
},
{
"name": "color",
"value": "green"
},
{
"name": "city",
"value": "#ERROR"
}
]
}
},
"inner_hits": {
"two": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "sample_index",
"_type": "_doc",
"_id": "mv1",
"_nested": {
"field": "doc.attributes",
"offset": 1
},
"_score": 0.0,
"_source": {
"name": "color",
"value": "green"
}
}
]
}
}
}
},
{
"_index": "sample_index",
"_type": "_doc",
"_id": "mv2",
"_score": null,
"_source": {
"doc": {
"workflow_id": "workflow1",
"attributes": [
{
"name": "price",
"value": "34"
},
{
"name": "color",
"value": "green"
}
]
}
},
"inner_hits": {
"two": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "sample_index",
"_type": "_doc",
"_id": "mv1",
"_nested": {
"field": "doc.attributes",
"offset": 1
},
"_score": 0.0,
"_source": {
"name": "color",
"value": "green"
}
}
]
}
}
}
}
]
}
Note the attribute name and value received inside the inner_hits object.
I want to get other attribute names and values as well in the response for which I'm putting any filter. For example, if I want to get attribute names and values for weight, color & city only, how do I do that?
I've checked this thread select matching objects from array in elasticsearch, but it doesn't solve my problem.
I have some documents contains fields: id, size, etc
And I want find all the possible size where id = 1 or 2
Is this possible?
You can use terms query with source filtering. Adding a working example
Index Data:
{
"id": 1,
"size": 1
}
{
"id": 2,
"size": 2
}
{
"id": 3,
"size": 3
}
Search Query:
{
"_source": "size",
"query": {
"terms": {
"id": [
1,
2
]
}
}
}
Search Result:
"hits": [
{
"_index": "66381642",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"size": 1
}
},
{
"_index": "66381642",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"size": 2
}
}
]
If you want to show the possible sizes of that Ids then you should use an aggregation.
POST your_index/_search
{
"size": 0,
"query": {
"terms": {
"id": [
"1",
"2"
]
}
},
"aggs": {
"sizes": {
"terms": {
"field": "size"
}
}
}
}
The response will be the unique size with the amount of docs with that size
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"sizes" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1,
"doc_count" : 1
},
{
"key" : 2,
"doc_count" : 1
}
]
}
}
}
I have index in Elasticsearch. Documents in it have duplicate field values. And in query result I need to remove all duplicates, and get only distinct values. For example:
PUT localhost:9200/person
{
"mappings" : {
"person" : {
"properties" : {
"name" : { "type" : "keyword" }
}
}
}
}
POST localhost:9200/person/person
{
"name": "John"
}
{
"name": "John"
}
{
"name": "Marry"
}
{
"name": "Tomas"
}
I'm trying to remove duplicated with terms aggregation by field "name", but it doesn't work.
GET localhost:9200/person/person/_search
{
"size": 3,
"query": {
"function_score": {
"functions": [
{
"random_score": {
"seed": "dasdfdLBpnM0"
}
}
]
}
},
"aggs": {
"top-names": {
"terms": {
"field": "name",
"size": 3
},
"aggs": {
"top_names_hits": {
"top_hits": {
"size": 1
}
}
}
}
}
}
Result:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0.9506482,
"hits": [
{
"_index": "person",
"_type": "person",
"_id": "H-5D8GoB8pRyckNSVUeN",
"_score": 0.9506482,
"_source": {
"name": "Tomas"
}
},
{
"_index": "person",
"_type": "person",
"_id": "He5D8GoB8pRyckNSPEfa",
"_score": 0.7700638,
"_source": {
"name": "John"
}
},
{
"_index": "person",
"_type": "person",
"_id": "HO5D8GoB8pRyckNSN0fo",
"_score": 0.71723765,
"_source": {
"name": "John"
}
}
]
},
"aggregations": {
"top-names": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John",
"doc_count": 2,
"top_names_hits": {
"hits": {
"total": 2,
"max_score": 0.7700638,
"hits": [
{
"_index": "person",
"_type": "person",
"_id": "He5D8GoB8pRyckNSPEfa",
"_score": 0.7700638,
"_source": {
"name": "John"
}
}
]
}
}
},
{
"key": "Marry",
"doc_count": 1,
"top_names_hits": {
"hits": {
"total": 1,
"max_score": 0.66815424,
"hits": [
{
"_index": "person",
"_type": "person",
"_id": "Iu5D8GoB8pRyckNScUdv",
"_score": 0.66815424,
"_source": {
"name": "Marry"
}
}
]
}
}
},
{
"key": "Tomas",
"doc_count": 1,
"top_names_hits": {
"hits": {
"total": 1,
"max_score": 0.9506482,
"hits": [
{
"_index": "person",
"_type": "person",
"_id": "H-5D8GoB8pRyckNSVUeN",
"_score": 0.9506482,
"_source": {
"name": "Tomas"
}
}
]
}
}
}
]
}
}
}
Aggregation applied to documents with name = "Marry", but I don't understand why, and how can I apply aggregation only to query results.
Below is more or less Elasticsearch Query blueprint....
{
"size": n, // Return the n documents based on "query" section (to frontend)
"query": {
// Here is where you are supposed to mention what documents you want
// Any filter/bool/match query condition
// In your case, you haven't specified any correct condition.
// So basically, it would return all the documents or documents based on size parameter. In your case it returns 3.
},
"aggs":{
// This aggregation query would only be applied on documents
// based on documents filtered/matched by the "query" section.
// In your case it is applying aggregation on all documents of that index as per the comment I've mentioned in the above query section.
}
}
Aggregation Query:
To get what you are looking for simply make use of below simplified query which you had with Terms Aggregation with Top Hits as sub-aggregation.
POST person/_search
{
"size": 0, <------- This is to say, I don't want "query" results to be returned and that I only want below aggregation results.
"aggs": {
"top-names": {
"terms": {
"field": "name",
"size": 10
},
"aggs": {
"top_hits_documents": { <------- Top hits would return the actual documents
"top_hits": {
"size": 1
}
}
}
}
}
}
By specifying "size": 0, at the very top you are basically applying aggregation on all the documents and that you are not returning any query results.
You simply return aggregation results.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 0.0,
"hits" : [ ] <------ Notice this. No query results returned
},
"aggregations" : { <------ Aggregation Result starts
"top-names" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "John", <------- This is to say there's a value called John
"doc_count" : 2, <------- John occurs in two documents.
"top_hits_documents" : {
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [
{
"_index" : "person",
"_type" : "person",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"name" : "John"
}
}
]
}
}
},
{
"key" : "Marry",
"doc_count" : 1,
"top_hits_documents" : {
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "person",
"_type" : "person",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"name" : "Marry"
}
}
]
}
}
},
{
"key" : "Thomas",
"doc_count" : 1,
"top_hits_documents" : {
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "person",
"_type" : "person",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"name" : "Thomas"
}
}
]
}
}
}
]
}
}
}
Hope that helps!
I have created a elasticsearch query with function score and top_hit. This query will remove the duplicate and return top 1 record for each bucket.
GET employeeid/info/_search
{"size": 0,
"query" : {
"function_score" : {
"query" : {
"match" : {
"employeeID" : "23141A"
}
},
"functions" : [{
"linear" : {
"AcquiredDate" : {
"scale" : "90d",
"decay" : 0.5
}
}
}, {
"filter" : {
"match" : {
"name" : "sorna"
}
},
"boost_factor" : 10
}, {
"filter" : {
"match" : {
"name" : "lingam"
}
},
"boost_factor" : 7
}
],
"boost_mode" : "replace"
}
},
"aggs": {
"duplicateCount": {
"terms": {
"field": "employeehash",
"min_doc_count": 1
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"size":1
}
}
}
}
}
}
I am getting the expected result, But the problem is i want to sort the result using _score.
Following is my simple o/p
{
"key": "567",
"doc_count": 2,
"duplicateDocuments": {
"hits": {
"total": 2,
"max_score": 0.40220365,
"hits": [
{
"_index": "employeeid",
"_type": "info",
"_id": "5",
"_score": 0.40220365,
"_source": {
"name": "John",
"organisation": "google",
"employeeID": "23141A",
"employeehash": "567",
"AcquiredDate": "2016-02-01T07:57:28Z"
}
}
]
}
}
},
{
"key": "102",
"doc_count": 1,
"duplicateDocuments": {
"hits": {
"total": 1,
"max_score": 2.8154256,
"hits": [
{
"_index": "employeeid",
"_type": "info",
"_id": "8",
"_score": 2.8154256,
"_source": {
"name": "lingam",
"organisation": "google",
"employeeID": "23141A",
"employeehash": "102",
"AcquiredDate": "2016-02-01T07:57:28Z"
}
}
]
}
}
}
Question: How to sort _score : desc ?
i have not enabled groovy so i can not use script
{
"took": 53,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 1.0,
"hits": [{
"_index": "db",
"_type": "users",
"_id": "AVOiyjHmzUObmc5euUGS",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/20 02:03:11",
"browser": "chrome",
"offercode": "JB20"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjIQzUObmc5euUGT",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/20 03:04:15",
"browser": "firefox",
"offercode": "JB20,JB50"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjIlzUObmc5euUGU",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/21 00:15:21",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjJKzUObmc5euUGW",
"_score": 1.0,
"_source": {
"user": "peter",
"lastvisited": "2016/01/20 02:32:22",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiy4jhzUObmc5euUGX",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/19 02:03:11",
"browser": "chrome",
"offercode": ""
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjI2zUObmc5euUGV",
"_score": 1.0,
"_source": {
"user": "adams",
"lastvisited": "2016/01/20 00:12:11",
"browser": "chrome",
"offercode": "JB10"
}
}]
}
}
I want to filter out the document based on the user last visited time and get the most recent accessed document of an individual user and then group all the filtered documents based on offer code.
I get the most recent accessed document of an user by performing tophits aggregation. But, I can't able to group the results of tophits aggregation using the offercode.
ES Query to get most recent document of a user
curl -XGET localhost:9200/account/users/_search?pretty -d'{
"size": "0",
"query": {
"bool": {
"must": {
"range": {
"lastvisited": {
"gte": "2016/01/19",
"lte": "2016/01/21"
}
}
}
}
},
"aggs": {
"lastvisited_users": {
"terms": {
"field": "user"
}
,
"aggs": {
"top_user_hits": {
"top_hits": {
"sort": [
{
"lastvisited": {
"order": "desc"
}
}
],
"_source": {
"include": [
"user","offercode","lastvisited"
]
},
"size": 1
}
}
}
}
}}'
ES Output
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 6,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"lastvisited_users" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "james",
"doc_count" : 3,
"top_user_hits" : {
"hits" : {
"total" : 3,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexIEz1WBU8vnnZ2d",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 03:04:15",
"offercode" : "JB20,JB50",
"user" : "james"
},
"sort" : [ 1453259055000 ]
} ]
}
}
}, {
"key" : "adams",
"doc_count" : 1,
"top_user_hits" : {
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexJMz1WBU8vnnZ2h",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 00:12:11",
"offercode" : "JB10",
"user" : "adams"
},
"sort" : [ 1453248731000 ]
} ]
}
}
}, {
"key" : "adamsnew",
"doc_count" : 1,
"top_user_hits" : {
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexJhz1WBU8vnnZ2i",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 00:12:11",
"offercode" : "JB1010,aka10",
"user" : "adamsnew"
},
"sort" : [ 1453248731000 ]
} ]
}
}
}, {
"key" : "peter",
"doc_count" : 1,
"top_user_hits" : {
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexIoz1WBU8vnnZ2f",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 02:32:22",
"offercode" : "JB20,JB50,JB100",
"user" : "peter"
},
"sort" : [ 1453257142000 ]
} ]
}
}
} ]
}
}
}
Now, I want to aggregate the results of tophits aggregation.
Expected Output
{
"offercode_grouped": {
"JB20": 1,
"JB10": 1,
"JB20,JB50": 1,
"JB20,JB50,JB100": 2,
"":1
}
}
I tried using Pipeline aggregation but I don't know how to groupby the result of tophits aggregation.
I hope that I understand your problem correctly. I think I found a bit hacky "solution".
It is a combination of function_score query, sampler aggregation and terms aggregation.
Create new index
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow" -d'
{
"mappings": {
"document": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"lastvisited": {
"type": "date",
"format": "YYYY/MM/dd HH:mm:ss"
},
"browser": {
"type": "string",
"index": "not_analyzed"
},
"offercode": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'
Index documents
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/1?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/20 02:03:11",
"browser": "chrome",
"offercode": "JB20"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/2?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/20 03:04:15",
"browser": "firefox",
"offercode": "JB20,JB50"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/3?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/21 00:15:21",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/4?routing=peter" -d'
{
"user": "peter",
"lastvisited": "2016/01/20 02:32:22",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/5?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/19 02:03:11",
"browser": "chrome",
"offercode": ""
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/6?routing=adams" -d'
{
"user": "adams",
"lastvisited": "2016/01/20 00:12:11",
"browser": "chrome",
"offercode": "JB10"
}'
Get aggregations
curl -XPOST "http://127.0.0.1:9200/stackoverflow/_search" -d'
{
"query": {
"function_score": {
"boost_mode": "replace", // we need to replace document score with the result of the functions
"query": {
"bool": {
"filter": [
{
"range": { // get documents within the date range
"lastvisited": {
"gte": "2016/01/19 00:00:00",
"lte": "2016/01/21 23:59:59"
}
}
}
]
}
},
"functions": [
{
"linear": {
"lastvisited": {
"origin": "2016/01/21 23:59:59", // same as lastvisited lte filter
"scale": "2d" // set the scale - please, see elasticsearch docs for more info https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-function-score-query.html#function-decay
}
}
}
]
}
},
"aggs": {
"user": {
"sampler": { // get top scored document per user
"field": "user",
"max_docs_per_value": 1
},
"aggs": {
"offers": { // aggregate user documents per `offercode`
"terms": {
"field": "offercode"
}
}
}
}
},
"size": 0
}'
Response
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0,
"hits": []
},
"aggregations": {
"user": {
"doc_count": 3,
"offers": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "JB20,JB50,JB100",
"doc_count": 2
},
{
"key": "JB10",
"doc_count": 1
}
]
}
}
}
}
Unless you have only one shard per index, you need to specify routing when indexing data. It is because sampler aggregation is calculated per shard. So we need to ensure that all data of particular user will be in the same shard - to get one document with highest score per user.
Sampler aggregation returns documents by score. That is why we need to modify score of the documents. There is where function_score query can help. Using field_value_factor, the score is just the timestamp of last visit - so the more recent the visit, the higher the score.
UPDATE: With field_value_factor there is probably problem with _score accuracy. For more info see issue https://github.com/elastic/elasticsearch/issues/11872. That is why decay function is used as clintongormley suggested in the issue. Because decay function works for both sides from origin. It means that documents 1 day older and 1 day younger than origin recevive the same _score. That's why we need to filter out newer documents (see range filter in the query).
NOTE: I tried this query just with the data which you can see in the example, so bigger dataset is needed to test the query. But I think it should work...
Check this solution: it's more limited, but it is suitable for production: https://stackoverflow.com/a/39788948/4769188
This may solve your problem:
SELECT offercode, count(offercode)
FROM users AS u1
WHERE u1.ID = (SELECT u2.ID FROM users AS u2 WHERE u2.user = u1.user ORDER BY u2.lastvisited DESC LIMIT 1)
AND u1.lastvisited >= "2016/01/20"
AND ORDER BY lastvisited ASC AND GROUP BY offercode;