I am using Elasticsearch and I want to group our results by a specific field, returning only the most recent document per group. When scoring and sorting, I want the documents I am not returning (the ones that are older) to be ignored.
I have tried approaching this with collapse, however the "hidden" documents are also taken into account, which I would like to avoid.
Example
In the following example I have 2 groups of documents, which I would like to group by their email, taking for each group the most recent by created_at, and sort them by their rating descending.
With the data of the example, the most recent ones are Aaa 1 (with email aaa#aaa.com) and Bbb 4 (with email bbb#bbb.com). I want to sort by their rating descending, I am expecting Bbb 4 and then Aaa 1. However, they are returned the other way around, because the Aaa 2 and Aaa 3 are also scored, which I want to avoid.
How can I write my query in a way that would return Bbb 4 and then Aaa 1? Should I be using the top_hits aggregation instead?
PUT test
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"email": {
"type": "keyword"
},
"description": {
"type": "text"
},
"rating": {
"type": "integer"
},
"created_at": {
"type": "date"
}
}
}
}
POST test/_doc
{
"name": "Aaa 1",
"rating": 1,
"created_at": "2021-01-01",
"description": "A quick fox",
"email": "aaa#aaa.com"
}
POST test/_doc
{
"name": "Aaa 2",
"rating": 20,
"created_at": "2020-01-01",
"description": "jumps over",
"email": "aaa#aaa.com"
}
POST test/_doc
{
"name": "Aaa 3",
"rating": 30,
"created_at": "2019-01-01",
"description": "the fence",
"email": "aaa#aaa.com"
}
POST test/_doc
{
"name": "Bbb 4",
"rating": 4,
"created_at": "2021-01-02",
"description": "behind the house",
"email": "bbb#bbb.com"
}
POST test/_doc
{
"name": "Bbb 5",
"rating": 5,
"created_at": "2020-01-02",
"description": "we live in",
"email": "bbb#bbb.com"
}
GET test/_search
{
"_source": false,
"track_total_hits": false,
"query": {
"bool": {
"should": {
"match_all": {}
}
}
},
"collapse": {
"field": "email",
"inner_hits": [
{
"name": "last_document",
"size": 1,
"_source": ["name","email","rating"],
"sort": [
{
"created_at": {
"order": "desc"
}
}
]
}
]
},
"sort": [
{
"rating": {
"order": "desc"
}
}
]
}
This returns
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"max_score" : null,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "bccEn3oBRQ1dOOnBe3nD",
"_score" : null,
"fields" : {
"email" : [
"aaa#aaa.com"
]
},
"sort" : [
30
],
"inner_hits" : {
"last_document" : {
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "a8cEn3oBRQ1dOOnBdXli",
"_score" : null,
"_source" : {
"name" : "Aaa 1",
"rating" : 1,
"email" : "aaa#aaa.com"
},
"sort" : [
1609459200000
]
}
]
}
}
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "b8cEn3oBRQ1dOOnBiHkx",
"_score" : null,
"fields" : {
"email" : [
"bbb#bbb.com"
]
},
"sort" : [
5
],
"inner_hits" : {
"last_document" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "bscEn3oBRQ1dOOnBgHlt",
"_score" : null,
"_source" : {
"name" : "Bbb 4",
"rating" : 4,
"email" : "bbb#bbb.com"
},
"sort" : [
1609545600000
]
}
]
}
}
}
}
]
}
}
I have ran into the same problem. As far as I know this is not possible.
As a workaround you can do this:
GET test/_search
{
"_source": false,
"track_total_hits": false,
"query": {
"match_all": {}
},
"collapse": {
"field": "email"
},
"sort": [
{
"created_at": {
"order": "desc"
}
}
]
}
This would return the latest comment per email in your 'normal' hits array. You would then need to sort those by rating after the search.
The problem I have is that my result set is too large to fetch at once and re-sort them after the search. If you found a different solution to this, I would be happy to hear it :)
Related
I'm pretty new on Elasticsearch world and I might be missing some concept.
That's the scenario I'm not understanding:
I want to find a doc from the following criteria:
category.level = A
category.name = "John .G" OR "Chris T."
approved = yes (optional)
Mappings:
PUT data
{
"mappings": {
"properties": {
"createdAt": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZ"
},
"category": {
"type": "nested",
"properties": {
"name": {
"type": "text",
"analyzer": "keyword"
}
}
},
"approved": {
"type": "text",
"analyzer": "keyword"
}
}
}
}
Data:
POST data/_create/1
{
"category": [
{
"name": "John G.",
"level": "A"
},
{
"name": "Mary F.",
"level": "A"
}
],
"createdBy": "John",
"createdAt": "2022-04-18 19:09:27.527+0200",
"approved": "yes"
}
POST data/_create/2
{
"category": [
{
"name": "John G.",
"level": "A"
},
{
"name": "Chris T.",
"level": "A"
}
],
"createdBy": "John",
"createdAt": "2022-04-18 19:09:27.527+0200",
"approved": "no"
}
POST data/_create/3
{
"category": [
{
"name": "John G.",
"level": "C"
},
{
"name": "Phil C.",
"level": "C"
}
],
"createdBy": "John",
"createdAt": "2022-04-18 19:09:27.527+0200",
"approved": "no"
}
POST data/_create/4
{
"category": [
{
"name": "John G.",
"level": "A"
},
{
"name": "Chris T.",
"level": "A"
}
],
"createdBy": "John",
"createdAt": "2020-04-18 19:09:27.527+0200",
"approved": "yes"
}
POST data/_create/5
{
"category": [
{
"name": "Unknown A.",
"level": "A"
},
{
"name": "Unknown B.",
"level": "A"
}
],
"createdBy": "Unknown",
"createdAt": "2020-08-18 19:09:27.527+0200",
"approved": "yes"
}
Query:
GET data/_search
{
"query": {
"nested": {
"path": "category",
"query": {
"bool": {
"must": [
{"match": {"category.level": "A"}}
],
"should": [
{"term": {"category.name": "John G."}},
{"term": {"category.name": "Chris T."}},
{"term": {"approved": "yes"}}
],
"minimum_should_match": 1
}
}
}
}
}
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.4455402,
"hits" : [
{
"_index" : "data",
"_id" : "2",
"_score" : 1.4455402,
"_source" : {
"category" : [
{
"name" : "John G.",
"level" : "A"
},
{
"name" : "Chris T.",
"level" : "A"
}
],
"createdBy" : "John",
"createdAt" : "2022-04-18 19:09:27.527+0200",
"approved" : "no"
}
},
{
"_index" : "data",
"_id" : "4",
"_score" : 1.4455402,
"_source" : {
"category" : [
{
"name" : "John G.",
"level" : "A"
},
{
"name" : "Chris T.",
"level" : "A"
}
],
"createdBy" : "John",
"createdAt" : "2020-04-18 19:09:27.527+0200",
"approved" : "yes"
}
},
{
"_index" : "data",
"_id" : "1",
"_score" : 1.151647,
"_source" : {
"category" : [
{
"name" : "John G.",
"level" : "A"
},
{
"name" : "Mary F.",
"level" : "A"
}
],
"createdBy" : "John",
"createdAt" : "2022-04-18 19:09:27.527+0200",
"approved" : "yes"
}
}
]
}
}
Questions:
Why the first document returned is an approval = no? I was expecting that docs with approval = yes would be better scored.
Why doc with index = 5 (it doesn't attend the criteria category.name, but it does for approved = yes) is not being returned?
The optionality of approved = yes is not being expressed in the above query. How could I create a kind of extra separated should term with minimum_should_match: 0 ? Something that would increase the score but would not filter the results.
You need to use below query, which have main bool query. it have first must clause with nested query and it have bool query for category.level field and then another bool query with should clause for category.name field.
Now main bool query have should clause for approved which is used for boosting result with yes value (this is outside nested query).
POST data/_search
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "category",
"query": {
"bool": {
"must": [
{
"term": {
"category.level": {
"value": "a"
}
}
},
{
"bool": {
"should": [
{
"term": {
"category.name": "John G."
}
},
{
"term": {
"category.name": "Chris T."
}
}
]
}
}
]
}
}
}
}
],
"should": [
{
"term": {
"approved": "yes"
}
}
]
}
}
}
Result:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.9845366,
"hits" : [
{
"_index" : "data",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.9845366,
"_source" : {
"category" : [
{
"name" : "John G.",
"level" : "A"
},
{
"name" : "Chris T.",
"level" : "A"
}
],
"createdBy" : "John",
"createdAt" : "2020-04-18 19:09:27.527+0200",
"approved" : "yes"
}
},
{
"_index" : "data",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.6906434,
"_source" : {
"category" : [
{
"name" : "John G.",
"level" : "A"
},
{
"name" : "Mary F.",
"level" : "A"
}
],
"createdBy" : "John",
"createdAt" : "2022-04-18 19:09:27.527+0200",
"approved" : "yes"
}
},
{
"_index" : "data",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.4455402,
"_source" : {
"category" : [
{
"name" : "John G.",
"level" : "A"
},
{
"name" : "Chris T.",
"level" : "A"
}
],
"createdBy" : "John",
"createdAt" : "2022-04-18 19:09:27.527+0200",
"approved" : "no"
}
}
]
}
}
Why the first document returned is an approval = no? I was expecting
that docs with approval = yes would be better scored.
Because you have should clause inside nested query and it is no matching to any document as approved is outside category hence it is not changing score.
Why doc with index = 5 (it doesn't attend the criteria category.name,
but it does for approved = yes) is not being returned?
it is removed by your must clause, but if you need index =5 document as well then you can add two should clause, one for nested and one for approved and it will resolved your issue.
Your question 3 also resolved by my answer.
I tried your scenario with your mapping and sample data, and found the issue, you are using approved:yes in the nested query context which is causing the issue, which is causing the issue, if you change the query to below(Basically using approved:yes in the should block but outside the nested query), it solves all your issues.
{
"query": {
"bool": {
"should": [
{
"nested": {
"path": "category",
"query": {
"bool": {
"must": [
{
"match": {
"category.level": "A"
}
}
],
"should": [
{
"term": {
"category.name": "John G."
}
},
{
"term": {
"category.name": "Chris T."
}
}
]
}
}
}
},
{
"term": {
"approved": "yes"
}
}
]
}
}
}
And search result
"hits": [
{
"_index": "71967271",
"_id": "4",
"_score": 1.9845366,
"_source": {
"category": [
{
"name": "John G.",
"level": "A"
},
{
"name": "Chris T.",
"level": "A"
}
],
"createdBy": "John",
"createdAt": "2020-04-18 19:09:27.527+0200",
"approved": "yes"
}
},
{
"_index": "71967271",
"_id": "2",
"_score": 1.4455402,
"_source": {
"category": [
{
"name": "John G.",
"level": "A"
},
{
"name": "Chris T.",
"level": "A"
}
],
"createdBy": "John",
"createdAt": "2022-04-18 19:09:27.527+0200",
"approved": "no"
}
},
{
"_index": "71967271",
"_id": "1",
"_score": 1.2437345,
"_source": {
"category": [
{
"name": "John G.",
"level": "A"
},
{
"name": "Mary F.",
"level": "A"
}
],
"createdBy": "John",
"createdAt": "2022-04-18 19:09:27.527+0200",
"approved": "yes"
}
},
{
"_index": "71967271",
"_id": "5",
"_score": 0.7968255,
"_source": {
"category": [
{
"name": "Unknown A.",
"level": "A"
},
{
"name": "Unknown B.",
"level": "A"
}
],
"createdBy": "Unknown",
"createdAt": "2020-08-18 19:09:27.527+0200",
"approved": "yes"
}
}
]
I have a index with documents like this
[
{
"customer_id" : "123",
"country": "USA",
"department": "IT",
"creation_date" : "2021-06-23"
...
},
{
"customer_id" : "123",
"country": "USA",
"department": "IT",
"creation_date" : "2021-06-24"
...
},
{
"customer_id" : "345",
"country": "USA",
"department": "IT",
"creation_date" : "2021-06-25"
...
}
]
I want to get the list of all documents from specific country e.g USA, between a give time range with at least 2 occurrences of same customer_id.
With the above data, it should return
[
{
"customer_id" : "123",
"country": "USA",
"department": "IT",
"creation_date" : "2021-06-24"
...
}
]
Now, I tried the below ES query
POST /index_name/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"creation_date": {
"gte": "2021-06-23",
"lte": "2021-08-23"
}
}
},
{
"match": {
"country": "USA"
}
}
]
}
},
"aggs": {
"customer_agg": {
"terms": {
"field": "customer_id",
"min_doc_count": 2
}
}
}
}
The above query returns following result
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : 1.5587491,
"hits" : [...]
]
},
"aggregations" : {
"person_agg" : {
"doc_count_error_upper_bound" : 1,
"sum_other_doc_count" : 1,
"buckets" : [
{
"key" : "customer_id",
"doc_count" : 2
}
]
}
}
I don't need the list of buckets in response, but only the list of documents satisfying the condition. How can I achieve it?
On a first glance I noticed that in the search query you are searching by a field called creation_timestamp but in the mapping of the document you say that the date field you want to range check is called creation_date.
I decided to test this locally on Elasticsearch 7.10 and here are the settings I used
PUT /test-index-v1
PUT /test-index-v1/_mapping
{
"properties": {
"customer_id": {
"type": "keyword"
},
"country": {
"type": "keyword"
},
"department": {
"type": "keyword"
},
"creation-date": {
"type": "date"
}
}
}
As you can see I'm using keyword on the fields so that we can use - sorting, aggregation and etc.
After I created the index I imported the documents you gave as an example
POST /test-index-v1/_doc
{
"customer_id" : "345",
"country": "USA",
"department": "IT",
"creation_date" : "2021-06-25"
}
POST /test-index-v1/_doc
{
"customer_id" : "123",
"country": "USA",
"department": "IT",
"creation_date" : "2021-06-25"
}
POST /test-index-v1/_doc
{
"customer_id" : "123",
"country": "USA",
"department": "IT",
"creation_date" : "2021-06-24"
}
Then I executed this search query including a must match on the customer_id as well:
POST /test-index-v1/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"creation_date": {
"gte": "2021-06-23",
"lte": "2021-08-23"
}
}
},
{
"match": {
"country": "USA"
}
},
{
"match": {
"customer_id": "123"
}
}
]
}
},
"aggs": {
"customer_agg": {
"terms": {
"field": "customer_id",
"min_doc_count": 2
}
}
}
}
This query will return you the search hits as well. Using only an aggregation the searchHits won't be returned.
Here is the response I received:
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.6035349,
"hits" : [
{
"_index" : "test-index-v1",
"_type" : "_doc",
"_id" : "vbVD9HsBRVWFAvvZTW-l",
"_score" : 1.6035349,
"_source" : {
"customer_id" : "123",
"country" : "USA",
"department" : "IT",
"creation_date" : "2021-06-25"
}
},
{
"_index" : "test-index-v1",
"_type" : "_doc",
"_id" : "vrVD9HsBRVWFAvvZU29q",
"_score" : 1.6035349,
"_source" : {
"customer_id" : "123",
"country" : "USA",
"department" : "IT",
"creation_date" : "2021-06-24"
}
}
]
},
"aggregations" : {
"customer_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "123",
"doc_count" : 2
}
]
}
}
}
Hope this helps with your issue. Feel free to leave a comment if you have other questions regarding Elastic! :)
EDIT:
Regarding the grouping by customer_id in a certain date range I used this query:
POST /test-index-v1/_search
{
"aggs": {
"group_by_customer_id": {
"terms": {
"field": "customer_id"
},
"aggs": {
"dates_between": {
"filter": {
"range": {
"creation_date": {
"gte": "2020-06-23",
"lte": "2021-06-24"
}
}
}
}
}
}
}
}
And the response is:
"aggregations" : {
"group_by_customer_id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "123",
"doc_count" : 2,
"dates_between" : {
"doc_count" : 1
}
},
{
"key" : "345",
"doc_count" : 1,
"dates_between" : {
"doc_count" : 0
}
}
]
}
}
I am having an index which has nested fields. I want to include only particular nested object in response based on condition along with other fields. For example consider the mappings
PUT /users
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"address": {
"type": "nested",
"properties": {
"state": {
"type": "keyword"
},
"city": {
"type": "keyword"
},
"country": {
"type": "keyword"
}
}
}
}
}
I want to search users by name and expecting the response should only include nested object contains country = 'United States". Consider the following documents in users index
{
"users": [
{
"name": "John",
"address": [
{
"state": "Alabama",
"city": "Alabaster",
"Country": "United States"
},
{
"state": "New Delhi",
"city": "Agra",
"Country": "India"
}
]
},
{
"name": "Edward John",
"address": [
{
"state": "Illinois",
"city": "Chicago",
"Country": "United States"
},
{
"state": "Afula",
"city": "Afula",
"Country": "Israel"
}
]
},
,
{
"name": "Edward John",
"address": [
{
"state": "Afula",
"city": "Afula",
"Country": "Israel"
}
]
}
]
}
I am expecting the search result as follows
{
"users": [
{
"name": "John",
"address": [
{
"state": "Alabama",
"city": "Alabaster",
"Country": "United States"
}
]
},
{
"name": "Edward John",
"address": [
{
"state": "Illinois",
"city": "Chicago",
"Country": "United States"
}
]
},
,
{
"name": "Edward John",
"address": [
]
}
]
}
Kindly provide me a suitable elasticsearch query to fetch this documents
The correct query would be this one:
POST users/_search
{
"_source": [
"name"
],
"query": {
"bool": {
"should": [
{
"nested": {
"path": "address",
"query": {
"bool": {
"must": [
{
"match": {
"address.Country": "United States"
}
}
]
}
},
"inner_hits": {}
}
},
{
"bool": {
"must_not": [
{
"nested": {
"path": "address",
"query": {
"bool": {
"must": [
{
"match": {
"address.Country": "United States"
}
}
]
}
}
}
}
]
}
}
]
}
}
}
Which returns this:
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.489748,
"hits" : [
{
"_index" : "users",
"_type" : "_doc",
"_id" : "X8pINHgB2VNT6r1rJj04",
"_score" : 1.489748,
"_source" : {
"name" : "John"
},
"inner_hits" : {
"address" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.489748,
"hits" : [
{
"_index" : "users",
"_type" : "_doc",
"_id" : "X8pINHgB2VNT6r1rJj04",
"_nested" : {
"field" : "address",
"offset" : 0
},
"_score" : 1.489748,
"_source" : {
"city" : "Alabaster",
"Country" : "United States",
"state" : "Alabama"
}
}
]
}
}
}
},
{
"_index" : "users",
"_type" : "_doc",
"_id" : "XftINHgBAEsNDPLQQxL8",
"_score" : 1.489748,
"_source" : {
"name" : "Edward John"
},
"inner_hits" : {
"address" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.489748,
"hits" : [
{
"_index" : "users",
"_type" : "_doc",
"_id" : "XftINHgBAEsNDPLQQxL8",
"_nested" : {
"field" : "address",
"offset" : 0
},
"_score" : 1.489748,
"_source" : {
"city" : "Chicago",
"Country" : "United States",
"state" : "Illinois"
}
}
]
}
}
}
},
{
"_index" : "users",
"_type" : "_doc",
"_id" : "UoZINHgBNlJvCnAGVzE9",
"_score" : 0.0,
"_source" : {
"name" : "Edward John"
},
"inner_hits" : {
"address" : {
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
}
}
]
}
Try out using below query
{
"query": {
"nested": {
"path": "address",
"query": {
"bool": {
"must": [
{
"match": {
"address.Country": "United States"
}
}
]
}
},
"inner_hits": {}
}
}
}
Search Result will be
"hits": [
{
"_index": "66579117",
"_type": "_doc",
"_id": "1",
"_score": 0.6931471,
"_source": {
"name": "John",
"address": [
{
"sate": "Alabama",
"city": "Alabaster",
"Country": "United States"
},
{
"sate": "New Delhi",
"city": "Agra",
"Country": "India"
}
]
},
"inner_hits": {
"address": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.6931471,
"hits": [
{
"_index": "66579117",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "address",
"offset": 0
},
"_score": 0.6931471,
"_source": {
"sate": "Alabama",
"city": "Alabaster",
"Country": "United States"
}
}
]
}
}
}
},
{
"_index": "66579117",
"_type": "_doc",
"_id": "2",
"_score": 0.6931471,
"_source": {
"name": "Edward",
"address": [
{
"sate": "Illinois",
"city": "Chicago",
"Country": "United States"
},
{
"sate": "Afula",
"city": "Afula",
"Country": "Israel"
}
]
},
"inner_hits": {
"address": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.6931471,
"hits": [
{
"_index": "66579117",
"_type": "_doc",
"_id": "2",
"_nested": {
"field": "address",
"offset": 0
},
"_score": 0.6931471,
"_source": {
"sate": "Illinois",
"city": "Chicago",
"Country": "United States"
}
}
]
}
}
}
}
]
I have database with products. Each Product is composed of fields: uuid, group_id, title, since, till.
since and till define interval of availability.
Intervals [since, till] are disjoint pairs for each group_id. So there are no 2 products within one group for which intervals intersect.
I need to fetch a list of products that meets the following conditions:
on the list should be at most 1 product from each group
each product matches the given title
each product is current (since <= NOW <= till) OR if current product does not exist within its group, it should be the nearest product from the future (min(since) such that since >= NOW)
ES mapping:
{
"products": {
"mappings": {
"dynamic": "false",
"properties": {
"group_id": {
"type": "long",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"since": {
"type": "date",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"till": {
"type": "date",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
Is it possible to create such query in Elasticsearch?
Looking at your mapping, I've created sample documents, the query and its response as below:
Sample Documents:
POST product_index/_doc/1
{
"group_id": 1,
"title": "nike",
"since": "2020-01-01",
"till": "2020-03-31"
}
POST product_index/_doc/2
{
"group_id": 2,
"title": "nike",
"since": "2020-01-01",
"till": "2020-03-31"
}
POST product_index/_doc/3
{
"group_id": 3,
"title": "nike",
"since": "2020-03-15",
"till": "2020-03-31"
}
POST product_index/_doc/4
{
"group_id": 3,
"title": "nike",
"since": "2020-03-19",
"till": "2020-03-31"
}
As mentioned above, there are like 4 documents in total, group 1 and 2 have one document each while group 3 has two documents with both since >= now
Query Request:
The summary of the query is below:
Bool
- Must
- Match title as nike
- Should
- clause 1 - since <= now <= till
- clause 2 - now <= since
Agg
- Terms on GroupId
- Top Hits (retrieve only 1st document as your clause is at most for each group, and sort them by asc order of since)
Below is the actual query:
POST product_index/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"match": {
"title": "nike"
}
},
{
"bool": {
"should": [
{ <--- since <=now <= till
"bool": {
"must": [
{
"range": {
"till": {
"gte": "now"
}
}
},
{
"range": {
"since": {
"lte": "now"
}
}
}
]
}
},
{ <---- since >= now
"bool": {
"must": [
{
"range": {
"since": {
"gte": "now"
}
}
}
]
}
}
]
}
}
]
}
},
"aggs": {
"my_groups": {
"terms": {
"field": "group_id.keyword",
"size": 10
},
"aggs": {
"my_docs": {
"top_hits": {
"size": 1, <--- Note this to return at most one document
"sort": [
{ "since": { "order": "asc"} <--- Sort to return the lowest value of since
}
]
}
}
}
}
}
}
Notice that I've made use of Terms Aggregation and Top Hits as its sub-aggregation.
Response:
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_groups" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "3",
"doc_count" : 2,
"my_docs" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "product_index",
"_type" : "_doc",
"_id" : "3",
"_score" : null,
"_source" : {
"group_id" : 3,
"title" : "nike",
"since" : "2020-03-15",
"till" : "2020-03-31"
},
"sort" : [
1584230400000
]
}
]
}
}
},
{
"key" : "1",
"doc_count" : 1,
"my_docs" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "product_index",
"_type" : "_doc",
"_id" : "1",
"_score" : null,
"_source" : {
"group_id" : 1,
"title" : "nike",
"since" : "2020-01-01",
"till" : "2020-03-31"
},
"sort" : [
1577836800000
]
}
]
}
}
},
{
"key" : "2",
"doc_count" : 1,
"my_docs" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "product_index",
"_type" : "_doc",
"_id" : "2",
"_score" : null,
"_source" : {
"group_id" : 2,
"title" : "nike",
"since" : "2020-01-01",
"till" : "2020-03-31"
},
"sort" : [
1577836800000
]
}
]
}
}
}
]
}
}
}
Let me know if this helps!
I'm having difficulty understanding how to get highlighting to work.
My queries are returning the item, but I do not see the tags that would cause the highlight.
Here's the set up for the test index:
curl -XPUT 'http://localhost:9200/testfoo' -d '{
"mappings": {
"entry": {
"properties": {
"id": { "type": "integer" },
"owner": { "type": "string" },
"target": {
"properties": {
"id": { "type": "integer" },
"type": {
"type": "string",
"index": "not_analyzed"
}
}
},
"body": { "type": "string" },
"body_plain": { "type": "string"}
}
}
}
}'
Here's a couple of inserted documents:
curl -XPUT 'http://localhost:9200/testfoo/entry/1' -d'{
"id": 1,
"owner": "me",
"target": {
"type": "event",
"id": 100
},
"body": "<div>Message One has foobar in it</div>",
"body_plain": "Message One has foobar in it"
}'
curl -XPUT 'http://localhost:9200/testfoo/entry/2' -d'{
"id": 2,
"owner": "me",
"target": {
"type": "event",
"id": 200
},
"body": "<div>Message One has no bar in it</div>",
"body_plain": "Message One has no bar in it"
}'
A Simple search returns the expected document:
curl -XPOST 'http://localhost:9200/testfoo/_search?pretty' -d '{
"query": {
"simple_query_string": {
"query": "foobar"
}
}
}'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.09492774,
"hits" : [ {
"_index" : "testfoo",
"_type" : "entry",
"_id" : "1",
"_score" : 0.09492774,
"_source" : {
"id" : 1,
"owner" : "me",
"target" : {
"type" : "event",
"id" : 100
},
"body" : "<div>Message One has foobar in it</div>",
"body_plain" : "Message One has foobar in it"
}
} ]
}
}
However, when I add "highlighting" I get the same JSON but body_plain is not "highlighted" with the matching term:
curl -XPOST 'http://localhost:9200/testfoo/_search?pretty' -d '{
"query":{
"query": {
"simple_query_string":{
"query":"foobar"
}
}
},
"highlight": {
"pre_tags": [ "<div class=\"highlight\">" ],
"post_tags": [ "</div>" ],
"fields": {
"_all": {
"fragment_size": 10,
"number_of_fragments": 1
}
}
},
"sort": [
"_score"
],
"_source": [ "target", "id", "body_plain", "body" ],
"min_score": 0.9,
"size":10
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "testfoo",
"_type" : "entry",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"id" : 1,
"body" : "<div>Message One has foobar in it</div>",
"target" : {
"id" : 100,
"type" : "event"
},
"body_plain" : "Message One has foobar in it"
}
} ]
}
}
Where I was expecting body_plain to look like
Message One has <div class="highlight">foobar</div> in it
Wondering what I'm doing wrong. Thanks.
From the official documentation
In order to perform highlighting, the actual content of the field is
required. If the field in question is stored (has store set to true in
the mapping) it will be used, otherwise, the actual _source will be
loaded and the relevant field will be extracted from it.
The _all field cannot be extracted from _source, so it can only be
used for highlighting if it mapped to have store set to true.
You have two ways to solve this. Either you change your mapping to store the _all field:
{
"mappings": {
"entry": {
"_all": { <-- add this
"store": true
},
"properties": {
...
Or you change your query to this:
curl -XPOST 'http://localhost:9200/testfoo/_search?pretty' -d '{
"query":{
"query": {
"simple_query_string":{
"query":"foobar"
}
}
},
"highlight": {
"pre_tags": [ "<div class=\"highlight\">" ],
"post_tags": [ "</div>" ],
"require_field_match": false, <-- add this
"fields": {
"*": { <-- use this
"fragment_size": 10,
"number_of_fragments": 1
}
}
},
"sort": [
"_score"
],
"_source": [ "target", "id", "body_plain", "body" ],
"min_score": 0.9,
"size":10
}'