ElasticSearch Highlighting Not highlighting - elasticsearch

I'm having difficulty understanding how to get highlighting to work.
My queries are returning the item, but I do not see the tags that would cause the highlight.
Here's the set up for the test index:
curl -XPUT 'http://localhost:9200/testfoo' -d '{
"mappings": {
"entry": {
"properties": {
"id": { "type": "integer" },
"owner": { "type": "string" },
"target": {
"properties": {
"id": { "type": "integer" },
"type": {
"type": "string",
"index": "not_analyzed"
}
}
},
"body": { "type": "string" },
"body_plain": { "type": "string"}
}
}
}
}'
Here's a couple of inserted documents:
curl -XPUT 'http://localhost:9200/testfoo/entry/1' -d'{
"id": 1,
"owner": "me",
"target": {
"type": "event",
"id": 100
},
"body": "<div>Message One has foobar in it</div>",
"body_plain": "Message One has foobar in it"
}'
curl -XPUT 'http://localhost:9200/testfoo/entry/2' -d'{
"id": 2,
"owner": "me",
"target": {
"type": "event",
"id": 200
},
"body": "<div>Message One has no bar in it</div>",
"body_plain": "Message One has no bar in it"
}'
A Simple search returns the expected document:
curl -XPOST 'http://localhost:9200/testfoo/_search?pretty' -d '{
"query": {
"simple_query_string": {
"query": "foobar"
}
}
}'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.09492774,
"hits" : [ {
"_index" : "testfoo",
"_type" : "entry",
"_id" : "1",
"_score" : 0.09492774,
"_source" : {
"id" : 1,
"owner" : "me",
"target" : {
"type" : "event",
"id" : 100
},
"body" : "<div>Message One has foobar in it</div>",
"body_plain" : "Message One has foobar in it"
}
} ]
}
}
However, when I add "highlighting" I get the same JSON but body_plain is not "highlighted" with the matching term:
curl -XPOST 'http://localhost:9200/testfoo/_search?pretty' -d '{
"query":{
"query": {
"simple_query_string":{
"query":"foobar"
}
}
},
"highlight": {
"pre_tags": [ "<div class=\"highlight\">" ],
"post_tags": [ "</div>" ],
"fields": {
"_all": {
"fragment_size": 10,
"number_of_fragments": 1
}
}
},
"sort": [
"_score"
],
"_source": [ "target", "id", "body_plain", "body" ],
"min_score": 0.9,
"size":10
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "testfoo",
"_type" : "entry",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"id" : 1,
"body" : "<div>Message One has foobar in it</div>",
"target" : {
"id" : 100,
"type" : "event"
},
"body_plain" : "Message One has foobar in it"
}
} ]
}
}
Where I was expecting body_plain to look like
Message One has <div class="highlight">foobar</div> in it
Wondering what I'm doing wrong. Thanks.

From the official documentation
In order to perform highlighting, the actual content of the field is
required. If the field in question is stored (has store set to true in
the mapping) it will be used, otherwise, the actual _source will be
loaded and the relevant field will be extracted from it.
The _all field cannot be extracted from _source, so it can only be
used for highlighting if it mapped to have store set to true.
You have two ways to solve this. Either you change your mapping to store the _all field:
{
"mappings": {
"entry": {
"_all": { <-- add this
"store": true
},
"properties": {
...
Or you change your query to this:
curl -XPOST 'http://localhost:9200/testfoo/_search?pretty' -d '{
"query":{
"query": {
"simple_query_string":{
"query":"foobar"
}
}
},
"highlight": {
"pre_tags": [ "<div class=\"highlight\">" ],
"post_tags": [ "</div>" ],
"require_field_match": false, <-- add this
"fields": {
"*": { <-- use this
"fragment_size": 10,
"number_of_fragments": 1
}
}
},
"sort": [
"_score"
],
"_source": [ "target", "id", "body_plain", "body" ],
"min_score": 0.9,
"size":10
}'

Related

Elasticsearch: collapsed results participating in sorting

I am using Elasticsearch and I want to group our results by a specific field, returning only the most recent document per group. When scoring and sorting, I want the documents I am not returning (the ones that are older) to be ignored.
I have tried approaching this with collapse, however the "hidden" documents are also taken into account, which I would like to avoid.
Example
In the following example I have 2 groups of documents, which I would like to group by their email, taking for each group the most recent by created_at, and sort them by their rating descending.
With the data of the example, the most recent ones are Aaa 1 (with email aaa#aaa.com) and Bbb 4 (with email bbb#bbb.com). I want to sort by their rating descending, I am expecting Bbb 4 and then Aaa 1. However, they are returned the other way around, because the Aaa 2 and Aaa 3 are also scored, which I want to avoid.
How can I write my query in a way that would return Bbb 4 and then Aaa 1? Should I be using the top_hits aggregation instead?
PUT test
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"email": {
"type": "keyword"
},
"description": {
"type": "text"
},
"rating": {
"type": "integer"
},
"created_at": {
"type": "date"
}
}
}
}
POST test/_doc
{
"name": "Aaa 1",
"rating": 1,
"created_at": "2021-01-01",
"description": "A quick fox",
"email": "aaa#aaa.com"
}
POST test/_doc
{
"name": "Aaa 2",
"rating": 20,
"created_at": "2020-01-01",
"description": "jumps over",
"email": "aaa#aaa.com"
}
POST test/_doc
{
"name": "Aaa 3",
"rating": 30,
"created_at": "2019-01-01",
"description": "the fence",
"email": "aaa#aaa.com"
}
POST test/_doc
{
"name": "Bbb 4",
"rating": 4,
"created_at": "2021-01-02",
"description": "behind the house",
"email": "bbb#bbb.com"
}
POST test/_doc
{
"name": "Bbb 5",
"rating": 5,
"created_at": "2020-01-02",
"description": "we live in",
"email": "bbb#bbb.com"
}
GET test/_search
{
"_source": false,
"track_total_hits": false,
"query": {
"bool": {
"should": {
"match_all": {}
}
}
},
"collapse": {
"field": "email",
"inner_hits": [
{
"name": "last_document",
"size": 1,
"_source": ["name","email","rating"],
"sort": [
{
"created_at": {
"order": "desc"
}
}
]
}
]
},
"sort": [
{
"rating": {
"order": "desc"
}
}
]
}
This returns
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"max_score" : null,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "bccEn3oBRQ1dOOnBe3nD",
"_score" : null,
"fields" : {
"email" : [
"aaa#aaa.com"
]
},
"sort" : [
30
],
"inner_hits" : {
"last_document" : {
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "a8cEn3oBRQ1dOOnBdXli",
"_score" : null,
"_source" : {
"name" : "Aaa 1",
"rating" : 1,
"email" : "aaa#aaa.com"
},
"sort" : [
1609459200000
]
}
]
}
}
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "b8cEn3oBRQ1dOOnBiHkx",
"_score" : null,
"fields" : {
"email" : [
"bbb#bbb.com"
]
},
"sort" : [
5
],
"inner_hits" : {
"last_document" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "bscEn3oBRQ1dOOnBgHlt",
"_score" : null,
"_source" : {
"name" : "Bbb 4",
"rating" : 4,
"email" : "bbb#bbb.com"
},
"sort" : [
1609545600000
]
}
]
}
}
}
}
]
}
}
I have ran into the same problem. As far as I know this is not possible.
As a workaround you can do this:
GET test/_search
{
"_source": false,
"track_total_hits": false,
"query": {
"match_all": {}
},
"collapse": {
"field": "email"
},
"sort": [
{
"created_at": {
"order": "desc"
}
}
]
}
This would return the latest comment per email in your 'normal' hits array. You would then need to sort those by rating after the search.
The problem I have is that my result set is too large to fetch at once and re-sort them after the search. If you found a different solution to this, I would be happy to hear it :)

Finding all objects with a certain field in ElasticSearch

My mapping looks like so:
"condition": {
"properties": {
"name": {
"type": "keyword"
},
"value": {
"type": "keyword"
}
}
},
and some data I have looks like:
"condition": [
{
"name": "condition",
"value": "new",
},
{
"name": "condition",
"value": "gently-used",
}
]
How can I write a query that finds all objects within the array that have a new condition?
I have the following but I am getting 0 results back:
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"attribute_condition": "new"
}
}
]
}
}
}
First, you need to map your condition field as a nested type.
"condition": {
"type": "nested",
"properties": {
"name": { "type": "keyword" },
"value": { "type": "keyword" }
}
},
Now you're able to query each element of the condition array independently from each other. Next, you need to use the nested query and request to retrieve the inner hits and output them in the inner_hits object of the query response
{
"query": {
"bool": {
"must": {
"nested": {
"path": "condition",
"query": {
"match": {
"condition.value": "new"
}
},
"inner_hits": {}
}
}
}
}
}
An example response will look like below:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.6931471,
"hits" : [
{
"_index" : "nested",
"_type" : "_doc",
"_id" : "Xx_LN3gBp5RUqdfAef3B",
"_score" : 0.6931471,
"_source" : {
"condition" : [
{
"name" : "condition",
"value" : "new"
},
{
"name" : "condition",
"value" : "gently-used"
}
]
},
"inner_hits" : { <--- here begins the list of inner hits
"condition" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.6931471,
"hits" : [
{
"_index" : "nested",
"_type" : "_doc",
"_id" : "Xx_LN3gBp5RUqdfAef3B",
"_nested" : {
"field" : "condition",
"offset" : 0
},
"_score" : 0.6931471,
"_source" : {
"name" : "condition",
"value" : "new"
}
}
]
}
}
}
}
]
}
}

Elasticsearch sort settings on index giving strange results

I have an index set up like so:
PUT items
{
"settings": {
"index": {
"sort.field": ["popularity", "title_keyword"],
"sort.order": ["desc", "asc"]
},
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 15,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
},
"title_keyword": {
"type": "keyword"
},
"popularity": {
"type": "integer"
},
"visibility": {
"type": "keyword"
}
}
}
}
With the following data:
POST items/_doc/1
{
"title": "The Arbor",
"popularity": 5,
"title_keyword": "The Arbor",
"visibility": "public"
}
POST items/_doc/2
{
"title": "The Canon",
"popularity": 10,
"title_keyword": "The Canon",
"visibility": "public"
}
POST items/_doc/3
{
"title": "The Brew",
"popularity": 15,
"title_keyword": "The Brew",
"visibility": "public"
}
I run this query on the data:
GET items/_search
{
"size": 3,
"query": {
"bool": {
"must": [
{
"match": {
"title": {
"query": "the",
"operator": "and"
}
}
},
{
"match": {
"visibility": "public"
}
}
]
}
},
"highlight": {
"pre_tags": ["<mark>"],
"post_tags": ["</mark>"],
"fields": {
"title": {}
}
}
}
It seems to match the records correctly on the word the but the sorting does not seem to work. I would expect it to be sorted by popularity as defined and the results would be The Arbor, The Brew, The Canon in that order but the results I get are as follows:
{
"took" : 11,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.27381438,
"hits" : [
{
"_index" : "items",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.27381438,
"_source" : {
"title" : "The Brew",
"popularity" : 15,
"title_keyword" : "The Brew",
"visibility" : "public"
},
"highlight" : {
"title" : [
"<mark>The</mark> Brew"
]
}
},
{
"_index" : "items",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.26392496,
"_source" : {
"title" : "The Arbor",
"popularity" : 5,
"title_keyword" : "The Arbor",
"visibility" : "public"
},
"highlight" : {
"title" : [
"<mark>The</mark> Arbor"
]
}
},
{
"_index" : "items",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.26392496,
"_source" : {
"title" : "The Canon",
"popularity" : 10,
"title_keyword" : "The Canon",
"visibility" : "public"
},
"highlight" : {
"title" : [
"<mark>The</mark> Canon"
]
}
}
]
}
}
Does defining the sort fields and orders when creating the index, under the settings, automatically sort the results? It seems to be sorting by score and not the popularity. If I include the sort options in the query it gives me the correct results back:
GET items/_search
{
"size": 3,
"sort": [
{
"popularity": {
"order": "desc"
}
},
{
"title_keyword": {
"order": "asc"
}
}
],
"query": {
"bool": {
"must": [
{
"match": {
"title": {
"query": "the",
"operator": "and"
}
}
},
{
"match": {
"visibility": "public"
}
}
]
}
},
"highlight": {
"pre_tags": ["<mark>"],
"post_tags": ["</mark>"],
"fields": {
"title": {}
}
}
}
I read that including the sort in the query like this is inefficient and to include it in the settings. Am I not doing something when creating the index to make it sort by popularity by default? Does including the sort options in the query result in inefficient queries? Or do I actually need to include it in every query?
Hopefully this makes sense! Thanks
Index sorting defines how segments are sorted in a shard, this is not related to the sorting of search results. You can use a sorted index, if you often have searches that are sorted with the same criteria, then the index sort speeds up the search.
If your search has a different sort than the index or no sort at all, the index sort is not relevant.
Please see the documentation for index sorting and especially the part that explains how index sorting is used.

How to filter by the size of an array in nested type?

Let's say I have the following type:
{
"2019-11-04": {
"mappings": {
"_doc": {
"properties": {
"labels": {
"type": "nested",
"properties": {
"confidence": {
"type": "float"
},
"created_at": {
"type": "date",
"format": "strict_date_optional_time||date_time||epoch_millis"
},
"label": {
"type": "keyword"
},
"updated_at": {
"type": "date",
"format": "strict_date_optional_time||date_time||epoch_millis"
},
"value": {
"type": "keyword",
"fields": {
"numeric": {
"type": "float",
"ignore_malformed": true
}
}
}
}
},
"params": {
"type": "object"
},
"type": {
"type": "keyword"
}
}
}
}
}
}
And I want to filter by the size/length of the labels array. I've tried the following (as the official docs suggest):
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": "doc['labels'].size > 10"
}
}
}
}
}
}
but I keep getting:
{
"error": {
"root_cause": [
{
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:81)",
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:39)",
"doc['labels'].size > 10",
" ^---- HERE"
],
"script": "doc['labels'].size > 10",
"lang": "painless"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "2019-11-04",
"node": "kk5MNRPoR4SYeQpLk2By3A",
"reason": {
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:81)",
"org.elasticsearch.search.lookup.LeafDocLookup.get(LeafDocLookup.java:39)",
"doc['labels'].size > 10",
" ^---- HERE"
],
"script": "doc['labels'].size > 10",
"lang": "painless",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "No field found for [labels] in mapping with types []"
}
}
}
]
},
"status": 500
}
I'm afraid that is not something possible, because the field labels is not a field that ES saves or albiet creates an inverted index on.
Doc doc['fieldname'] is only applicable on the fields on which inverted index is created and Elasticsearch's Query DSL too only works on fields on which inverted index gets created and unfortunately nested type is not a valid field on which inverted index is created.
Having said so, I have the below two ways of doing this.
For the sake of simplicity, I've created sample mapping, documents and two possible solutions which may help you.
Mapping:
PUT my_sample_index
{
"mappings": {
"properties": {
"myfield": {
"type": "nested",
"properties": {
"label": {
"type": "keyword"
}
}
}
}
}
}
Sample Documents:
// single field inside 'myfield'
POST my_sample_index/_doc/1
{
"myfield": {
"label": ["New York", "LA", "Austin"]
}
}
// two fields inside 'myfield'
POST my_sample_index/_doc/2
{
"myfield": {
"label": ["London", "Leicester", "Newcastle", "Liverpool"],
"country": "England"
}
}
Solution 1: Using Script Fields (Managing at Application Level)
I have a workaround to get what you want, well not exactly but would help you filter out on your service layer or application.
POST my_sample_index/_search
{
"_source": "*",
"query": {
"bool": {
"must": [
{
"match_all": {}
}
]
}
},
"script_fields": {
"label_size": {
"script": {
"lang": "painless",
"source": "params['_source']['labels'].size() > 1"
}
}
}
}
You would notice that in response a separate field label_size gets created with true or false value.
A sample response is something like below:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_sample_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"myfield" : {
"label" : [
"New York",
"LA",
"Austin"
]
}
},
"fields" : {
"label_size" : [ <---- Scripted Field
false
]
}
},
{
"_index" : "my_sample_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"myfield" : {
"country" : "England",
"label" : [
"London",
"Leicester",
"Newcastle",
"Liverpool"
]
}
},
"fields" : { <---- Scripted Field
"label_size" : [
true <---- True because it has two fields 'labels' and 'country'
]
}
}
]
}
}
Note that only second document makes sense as it has two fields i.e. country and labels. However if you only want the docs with label_size with true, that'd would have to be managed at your application layer.
Solution 2: Reindexing with labels.size using Script Processor
Create a new index as below:
PUT my_sample_index_temp
{
"mappings": {
"properties": {
"myfield": {
"type": "nested",
"properties": {
"label": {
"type": "keyword"
}
}
},
"labels_size":{ <---- New Field where we'd store the size
"type": "integer"
}
}
}
}
Create the below pipeline:
PUT _ingest/pipeline/set_labels_size
{
"description": "sets the value of labels size",
"processors": [
{
"script": {
"source": """
ctx.labels_size = ctx.myfield.size();
"""
}
}
]
}
Use Reindex API to reindex from my_sample_index index
POST _reindex
{
"source": {
"index": "my_sample_index"
},
"dest": {
"index": "my_sample_index_temp",
"pipeline": "set_labels_size"
}
}
Verify the documents in my_sample_index_temp using GET my_sample_index_temp/_search
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_sample_index_temp",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"labels_size" : 1, <---- New Field Created
"myfield" : {
"label" : [
"New York",
"LA",
"Austin"
]
}
}
},
{
"_index" : "my_sample_index_temp",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"labels_size" : 2, <----- New Field Created
"myfield" : {
"country" : "England",
"label" : [
"London",
"Leicester",
"Newcastle",
"Liverpool"
]
}
}
}
]
}
}
Now you can simply use this field labels_size in your query and its way easier and not to mention efficient.
Hope this helps!
You can solve it with a custom score approach:
GET 2019-11-04/_search
{
"min_score": 0.1,
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"script_score": {
"script": {
"source": "params['_source']['labels'].length > 10 ? 1 : 0"
}
}
}
]
}
}
}

Group by result of top hits aggregation

{
"took": 53,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 1.0,
"hits": [{
"_index": "db",
"_type": "users",
"_id": "AVOiyjHmzUObmc5euUGS",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/20 02:03:11",
"browser": "chrome",
"offercode": "JB20"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjIQzUObmc5euUGT",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/20 03:04:15",
"browser": "firefox",
"offercode": "JB20,JB50"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjIlzUObmc5euUGU",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/21 00:15:21",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjJKzUObmc5euUGW",
"_score": 1.0,
"_source": {
"user": "peter",
"lastvisited": "2016/01/20 02:32:22",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiy4jhzUObmc5euUGX",
"_score": 1.0,
"_source": {
"user": "james",
"lastvisited": "2016/01/19 02:03:11",
"browser": "chrome",
"offercode": ""
}
}, {
"_index": "db",
"_type": "users",
"_id": "AVOiyjI2zUObmc5euUGV",
"_score": 1.0,
"_source": {
"user": "adams",
"lastvisited": "2016/01/20 00:12:11",
"browser": "chrome",
"offercode": "JB10"
}
}]
}
}
I want to filter out the document based on the user last visited time and get the most recent accessed document of an individual user and then group all the filtered documents based on offer code.
I get the most recent accessed document of an user by performing tophits aggregation. But, I can't able to group the results of tophits aggregation using the offercode.
ES Query to get most recent document of a user
curl -XGET localhost:9200/account/users/_search?pretty -d'{
"size": "0",
"query": {
"bool": {
"must": {
"range": {
"lastvisited": {
"gte": "2016/01/19",
"lte": "2016/01/21"
}
}
}
}
},
"aggs": {
"lastvisited_users": {
"terms": {
"field": "user"
}
,
"aggs": {
"top_user_hits": {
"top_hits": {
"sort": [
{
"lastvisited": {
"order": "desc"
}
}
],
"_source": {
"include": [
"user","offercode","lastvisited"
]
},
"size": 1
}
}
}
}
}}'
ES Output
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 6,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"lastvisited_users" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "james",
"doc_count" : 3,
"top_user_hits" : {
"hits" : {
"total" : 3,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexIEz1WBU8vnnZ2d",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 03:04:15",
"offercode" : "JB20,JB50",
"user" : "james"
},
"sort" : [ 1453259055000 ]
} ]
}
}
}, {
"key" : "adams",
"doc_count" : 1,
"top_user_hits" : {
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexJMz1WBU8vnnZ2h",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 00:12:11",
"offercode" : "JB10",
"user" : "adams"
},
"sort" : [ 1453248731000 ]
} ]
}
}
}, {
"key" : "adamsnew",
"doc_count" : 1,
"top_user_hits" : {
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexJhz1WBU8vnnZ2i",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 00:12:11",
"offercode" : "JB1010,aka10",
"user" : "adamsnew"
},
"sort" : [ 1453248731000 ]
} ]
}
}
}, {
"key" : "peter",
"doc_count" : 1,
"top_user_hits" : {
"hits" : {
"total" : 1,
"max_score" : null,
"hits" : [ {
"_index" : "accounts",
"_type" : "users",
"_id" : "AVOtexIoz1WBU8vnnZ2f",
"_score" : null,
"_source" : {
"lastvisited" : "2016/01/20 02:32:22",
"offercode" : "JB20,JB50,JB100",
"user" : "peter"
},
"sort" : [ 1453257142000 ]
} ]
}
}
} ]
}
}
}
Now, I want to aggregate the results of tophits aggregation.
Expected Output
{
"offercode_grouped": {
"JB20": 1,
"JB10": 1,
"JB20,JB50": 1,
"JB20,JB50,JB100": 2,
"":1
}
}
I tried using Pipeline aggregation but I don't know how to groupby the result of tophits aggregation.
I hope that I understand your problem correctly. I think I found a bit hacky "solution".
It is a combination of function_score query, sampler aggregation and terms aggregation.
Create new index
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow" -d'
{
"mappings": {
"document": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"lastvisited": {
"type": "date",
"format": "YYYY/MM/dd HH:mm:ss"
},
"browser": {
"type": "string",
"index": "not_analyzed"
},
"offercode": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'
Index documents
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/1?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/20 02:03:11",
"browser": "chrome",
"offercode": "JB20"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/2?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/20 03:04:15",
"browser": "firefox",
"offercode": "JB20,JB50"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/3?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/21 00:15:21",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/4?routing=peter" -d'
{
"user": "peter",
"lastvisited": "2016/01/20 02:32:22",
"browser": "chrome",
"offercode": "JB20,JB50,JB100"
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/5?routing=james" -d'
{
"user": "james",
"lastvisited": "2016/01/19 02:03:11",
"browser": "chrome",
"offercode": ""
}'
curl -s -XPUT "http://127.0.0.1:9200/stackoverflow/document/6?routing=adams" -d'
{
"user": "adams",
"lastvisited": "2016/01/20 00:12:11",
"browser": "chrome",
"offercode": "JB10"
}'
Get aggregations
curl -XPOST "http://127.0.0.1:9200/stackoverflow/_search" -d'
{
"query": {
"function_score": {
"boost_mode": "replace", // we need to replace document score with the result of the functions
"query": {
"bool": {
"filter": [
{
"range": { // get documents within the date range
"lastvisited": {
"gte": "2016/01/19 00:00:00",
"lte": "2016/01/21 23:59:59"
}
}
}
]
}
},
"functions": [
{
"linear": {
"lastvisited": {
"origin": "2016/01/21 23:59:59", // same as lastvisited lte filter
"scale": "2d" // set the scale - please, see elasticsearch docs for more info https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-function-score-query.html#function-decay
}
}
}
]
}
},
"aggs": {
"user": {
"sampler": { // get top scored document per user
"field": "user",
"max_docs_per_value": 1
},
"aggs": {
"offers": { // aggregate user documents per `offercode`
"terms": {
"field": "offercode"
}
}
}
}
},
"size": 0
}'
Response
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 0,
"hits": []
},
"aggregations": {
"user": {
"doc_count": 3,
"offers": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "JB20,JB50,JB100",
"doc_count": 2
},
{
"key": "JB10",
"doc_count": 1
}
]
}
}
}
}
Unless you have only one shard per index, you need to specify routing when indexing data. It is because sampler aggregation is calculated per shard. So we need to ensure that all data of particular user will be in the same shard - to get one document with highest score per user.
Sampler aggregation returns documents by score. That is why we need to modify score of the documents. There is where function_score query can help. Using field_value_factor, the score is just the timestamp of last visit - so the more recent the visit, the higher the score.
UPDATE: With field_value_factor there is probably problem with _score accuracy. For more info see issue https://github.com/elastic/elasticsearch/issues/11872. That is why decay function is used as clintongormley suggested in the issue. Because decay function works for both sides from origin. It means that documents 1 day older and 1 day younger than origin recevive the same _score. That's why we need to filter out newer documents (see range filter in the query).
NOTE: I tried this query just with the data which you can see in the example, so bigger dataset is needed to test the query. But I think it should work...
Check this solution: it's more limited, but it is suitable for production: https://stackoverflow.com/a/39788948/4769188
This may solve your problem:
SELECT offercode, count(offercode)
FROM users AS u1
WHERE u1.ID = (SELECT u2.ID FROM users AS u2 WHERE u2.user = u1.user ORDER BY u2.lastvisited DESC LIMIT 1)
AND u1.lastvisited >= "2016/01/20"
AND ORDER BY lastvisited ASC AND GROUP BY offercode;

Resources