We have an index of unique products where each document represents a single product, with the following fields: product_id, group_id, group_score, and product_score.
Consider the following index:
{
"product_id": "100-001",
"group_id": "100",
"group_score": 100,
"product_score": 60,
},
{
"product_id": "100-002",
"group_id": "100",
"group_score": 100,
"product_score": 40,
},
{
"product_id": "100-001",
"group_id": "100",
"group_score": 100,
"product_score": 50,
},
{
"product_id": "200-001",
"group_id": "200",
"group_score": 73,
"product_score": 20,
},
{
"product_id": "200-002",
"group_id": "200",
"group_score": 73,
"product_score": 53,
}
Every group contains ~1-200 products.
We are trying to a query that matches the following conditions:
1. Products should be sorted by their group_score (desc).
2. No more than one product per group_id.
3. Get the product with the highest product_score within the group.
For example, applying the query on the above should return:
{
"product_id": "100-001"
},
{
"product_id": "200-002"
}
We ended up with the following query:
{
"size": 0,
"aggs": {
"group_by_group_id": {
"terms": {
"field": "group_id",
"order":{
"max_group_score":"desc"
}
},
"aggs": {
"top_scores_hits": {
"top_hits": {
"sort": [
{
"product_score": {
"order": "desc"
}
}
],
"size": 1
}
},
"max_group_score":{
"max":{
"field":"group_score"
}
}
}
}
}
}
The problem is that the query is really slow because of the aggregations and the search performance is important.
We would love to hear your opinion about a better/efficient solution.
Changing the index structure is tolerable.
Related
Assume that we have this index in OpenSearch:
{
"settings": {
"index.knn": True,
"number_of_replicas": 0,
"number_of_shards": 1,
},
"mappings": {
"properties": {
"title": {"type": "text"},
"tag": {"type": "text"},
"e1": {
"type": "knn_vector",
"dimension": 512,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {"ef_construction": 512, "m": 24},
},
},
"e2": {
"type": "knn_vector",
"dimension": 512,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {"ef_construction": 512, "m": 24},
},
},
"e3": {
"type": "knn_vector",
"dimension": 512,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {"ef_construction": 512, "m": 24},
},
},
}
},
}
And we want to perform a search over all the fields (approximate knn for the vector fields). What would be the correct way to do this in OpenSearch?
I have this query that works but I'm not sure if it is the correct way of doing this and if it uses approximate knn:
{
"size": 10,
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {
"knn": {
"e1": {
"vector": [0, 1, 2, 3],
"k": 10,
},
}
},
"weight": 1,
}
},
{
"function_score": {
"query": {
"knn": {
"e2": {
"vector": [0, 1, 2, 3],
"k": 10,
},
}
},
"weight": 1,
}
},
{
"function_score": {
"query": {
"knn": {
"e3": {
"vector": [0, 1, 2, 3],
"k": 10,
},
}
},
"weight": 1,
}
},
{
"function_score": {
"query": {
"match": {"title": "title"}
},
"weight": 0.1,
}
},
{
"function_score": {
"query": {"match": {"tag": "tag"}},
"weight": 0.1,
}
},
]
}
},
"_source": False,
}
In other words, I want to know how this which is for ElasticSearch can be done in OpenSearch.
Edit 1:
I want to do this Elasticsearch new feature in OpenSearch. The question is how and also what does the query mentioned above does exactly.
First of all, searching multiple kNN fields in Elasticsearch is not yet supported.
Here you can find the development, not yet released, related to issue #91187 and PR #92118 that was merged for version 8.7... the current version is 8.6.
Looking at the OpenSearch documentation for k-NN, it does not appear to be supported there either.
However, regarding the query you provided:
knn search was not defined well... the right way is, for example:
{
"query": {
"knn": {
"my_vector": {
"vector": [2, 3, 5, 6],
"k": 2
}
}
}
}
where my_vector is the name of your vector field while vector is the query vector (i.e. query text encoded into the corresponding vectors) that must have the same number of dimensions as the vector field you are searching against.
the match query value was not defined well. Here the documentation.
the use of the function_score is unclear and not properly correct.
Finally, if you are interested in vector search with OpenSearch, we recently wrote a blog post in which we provide a detailed description of the new neural search plugin introduced with version 2.4.0 through an end-to-end testing experience.
Let's say I have documents like these-
{
"_id": 1,
"threat": {
"application_number": 1234,
}
"score_algorithms": [
{
"score": 21,
},
{
"score": 93,
}
],
"max_similarity": 93,
}
{
"_id": 2,
"threat": {
"application_number": 1348,
}
"score_algorithms": [
{
"score": 45,
},
{
"score": 67,
}
],
"max_similarity": 67,
}
{
"_id": 3,
"threat": {
"application_number": 1234,
}
"score_algorithms": [
{
"score": 98,
},
{
"score": 51,
}
],
"max_similarity": 98,
}
Now the agenda here is to -
Sort these documents according to the maximum similarity attribute max_similarity
Then, Aggregate the documents according to threat.application_number
For example, first result to come up should have a grouping of all documents where threat.application_number is 1234 (which has the max value of max_similarity). Second entry would be a grouping of all documents where threat.application_number is 1348 and so on and so forth.
All documents should internally have a sorted score_algorithms values.
For requirements 1. and 2. i.e., get the documents grouped and sorted you can use the order parameter in the aggregation definition.
To retrieve the score_algorithms field in the aggregation, use a top_hits sub aggregation.
You will only be able to retrieve the documents up to the size parameter of the top_hits aggregation. If you have a large number of documents for a single application_number it is likely to be slow.
{
"size": 0,
"aggs" : {
"applications" : {
"terms" : {
"field" : "threat.application_number",
"order": [{"stats.max": "desc"}]
},
"aggs" : {
"stats" : { "stats" : { "field" : "max_similarity" } },
"applications_fields": {
"top_hits": {
"sort": [
{
"max_similarity": {
"order": "desc"
}
}
],
"_source": {
"includes": [ "score_algorithms", "max_similarity" ]
},
"size" : 100
}
}
}
}
}
}
Let's say I have an elastic index with the following data:
{"var1": 14, "time": "2019-02-12T13:01:00.000Z"}
{"var2": 1423, "time": "2019-02-12T13:01:02.000Z"}
{"var3": 114, "time": "2019-02-12T13:01:03.000Z"}
{"var2": 214, "time": "2019-02-12T13:01:04.000Z"}
{"var3": 414, "time": "2019-02-12T13:01:05.000Z"}
{"var1": 124, "time": "2019-02-12T13:01:06.000Z"}
{"var2": 914, "time": "2019-02-12T13:01:07.000Z"}
{"var3": 8614, "time": "2019-02-12T13:01:06.000Z"}
{"var2": 74, "time": "2019-02-12T13:01:07.000Z"}
{"var3": 174, "time": "2019-02-12T13:01:08.000Z"}
{"var4": 144, "time": "2019-02-12T13:01:09.000Z"}
{"var4": 714, "time": "2019-02-12T13:01:10.000Z"}
{"var4": 813, "time": "2019-02-12T13:01:11.000Z"}
{"var2": 65, "time": "2019-02-12T13:01:12.000Z"}
{"var1": 321, "time": "2019-02-12T13:01:13.000Z"}
I would like to write ONE query that can retrieve the minimum of a variable, the maximum of a variable and the last n values of a variable in a given time interval.
It is important that I need the actual document that has the min or the max or the last value (this is why I'm using top_hits for the min and max instead of the min or max aggregations).
So far I have this query:
{
"query": {
"bool": {
"must": [
{
"range": {
"time": {
"gte": "2019-02-12T13:01:00.000Z",
"lt": "2019-02-12T13:01:15.000Z"
}
}
}
]
}
},
"size": 0,
"aggs": {
"max_var1": {
"top_hits": {
"size": 1,
"sort": [{
"var1": {"order": "desc"}
}]
}
},
"min_var2": {
"top_hits": {
"size": 1,
"sort": [{
"var2": {"order": "asc"}
}]
}
},
"last_var4": {
"top_hits": {
"size": 3,
"sort": [{
"time": {"order": "desc"}
}],
"_source": ["var4"]
}
}
}
}
The query returns correctly the min and the max value but it doesn't return the correct last 3 value for var4, because it takes the last from all the documents in the given time interval, and not the documents that have var4 in it.
So the question is how to get the last n documents for a given variable inside this query.
I know I could use the multi search API to execute several queries at once, but I would like to know if it is possible to have it in one query.
Thanks.
Filtered aggregation to the rescue. Simply make sure to constrain the last_var4 aggregation to only those docs that actually have the field var4.
{
...
"last_var4": {
"filter": {
"bool": {
"filter": {
"exists": {
"field": "var4"
}
}
}
},
"aggs": {
"last_var4": {
"top_hits": {
"size": 3,
"sort": [
{
"time": {
"order": "desc"
}
}
],
"_source": [
"var4"
]
}
}
}
}
}
}
Hopefully I will be able to explain this issue clearly enough :/
I am trying to run a query on a resultset that returns a list of users who have liked an artist AND have a score of greater then or equal to 500. Consider this index:
{
"profile": 12345,
"artists": [
{
"id": 135,
"score": 10
},
{
"id": 246,
"score": 50
},
{
"id": 1357,
"score": 100
}
]
},
{
"profile": 24680,
"artists": [
{
"id": 135,
"score": 1
},
{
"id": 246,
"score": 500
},
{
"id": 1357,
"score": 77
}
]
},
{
"profile": 13579,
"artists": [
{
"id": 135,
"score": 5
},
{
"id": 246,
"score": 1000
},
{
"id": 1357,
"score": 150
}
]
}
Now, I would want to find users who have an artist.id value of 1357 AND have a score of greater then or equal to 100. So, I would expect users 12345 and 13579 to be returned. However, if I run the following query:
{
"query": {
"bool": {
"must": [
{
"term": {
"artists.key": "1357"
}
}
],
"filter": {
"range": {
"artists.currentScore": {
"gte": 100
}
}
}
}
}
Then all three users are returned. Because user 24680 has a score of greater than 100 on one of his results, despite it not being the id that I am passing, he is still being treated as a match.
Does anyone know of a way of matching both conditions, or at least when filtering, filtering on those where the original condition matched...
...if that makes any sense
I'm trying to make a simple query in elasticsearch but I can't figure out how to do it. I searched all over the internet and there was no discussion on this situation.
Let's say I have items like those:
{
"item_id": 1,
"item_price": 100,
"item_quantity": 2
},
{
"item_id": 2,
"item_price": 200,
"item_quantity": 3
},
{
"item_id": 3,
"item_price": 150,
"item_quantity": 1
},
{
"item_id": 4,
"item_price": 250,
"item_quantity": 5
}
I want to make a query that will give me the result of the total price in the stock.
for example: 100*2 + 200*3 + 150*1 + 250*5
the result for this query supposed to be 2,200
The answer query for the last data is working, But what about this complex situation:
POST tests/test2/
{
"item_category": "aaa",
"items":
[
{
"item_id": 1,
"item_price": 100,
"item_quantity": 2
},
{
"item_id": 2,
"item_price": 150,
"item_quantity": 4
}
]
}
POST tests/test2/
{
"item_category": "bbb",
"items":
[
{
"item_id": 3,
"item_price": 200,
"item_quantity": 3
},
{
"item_id": 4,
"item_price": 200,
"item_quantity": 5
}
]
}
POST tests/test2/
{
"item_category": "ccc",
"items":
[
{
"item_id": 5,
"item_price": 300,
"item_quantity": 2
},
{
"item_id": 6,
"item_price": 150,
"item_quantity": 8
}
]
}
POST tests/test2/
{
"item_category": "ddd",
"items":
[
{
"item_id": 7,
"item_price": 80,
"item_quantity": 10
},
{
"item_id": 8,
"item_price": 250,
"item_quantity": 4
}
]
}
In this case the next query is not working and give me a wrong answer (1,420 instead of 6,000):
GET tests/test2/_search
{
"query": {
"match_all": { }
},
"aggs": {
"total_price": {
"sum": {
"script": {
"lang": "painless",
"inline": "doc['items.item_price'].value * doc['items.item_quantity'].value"
}
}
}
}
}
You can use sum aggregation for values calculated using script
{
"aggs": {
"total_price": {
"sum": {
"script": {
"lang": "painless",
"inline": "doc['item_price'].value * doc['item_quantity'].value"
}
}
}
}
}
Take a look here https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-sum-aggregation.html#_script_9 for more details
Update
As for your advanced case, it would be better to map your items field as nested type, after that you can use this aggregation
{
"aggs": {
"nested": {
"nested": {
"path": "items"
},
"aggs": {
"total_price": {
"sum": {
"script": {
"inline": "doc['items.item_price'].value * doc['items.item_quantity'].value"
}
}
}
}
}
}
}
this is the mapping query for the example DB in the question:
PUT tests
{
"mappings": {
"test2": {
"properties": {
"items": {
"type": "nested"
}
}
}
}
}
just to clarify, You must make the mapping query before the index has been created. (changing mapping for existing field is not allowed).