k-NN multiple field search in OpenSearch - elasticsearch

Assume that we have this index in OpenSearch:
{
"settings": {
"index.knn": True,
"number_of_replicas": 0,
"number_of_shards": 1,
},
"mappings": {
"properties": {
"title": {"type": "text"},
"tag": {"type": "text"},
"e1": {
"type": "knn_vector",
"dimension": 512,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {"ef_construction": 512, "m": 24},
},
},
"e2": {
"type": "knn_vector",
"dimension": 512,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {"ef_construction": 512, "m": 24},
},
},
"e3": {
"type": "knn_vector",
"dimension": 512,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {"ef_construction": 512, "m": 24},
},
},
}
},
}
And we want to perform a search over all the fields (approximate knn for the vector fields). What would be the correct way to do this in OpenSearch?
I have this query that works but I'm not sure if it is the correct way of doing this and if it uses approximate knn:
{
"size": 10,
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {
"knn": {
"e1": {
"vector": [0, 1, 2, 3],
"k": 10,
},
}
},
"weight": 1,
}
},
{
"function_score": {
"query": {
"knn": {
"e2": {
"vector": [0, 1, 2, 3],
"k": 10,
},
}
},
"weight": 1,
}
},
{
"function_score": {
"query": {
"knn": {
"e3": {
"vector": [0, 1, 2, 3],
"k": 10,
},
}
},
"weight": 1,
}
},
{
"function_score": {
"query": {
"match": {"title": "title"}
},
"weight": 0.1,
}
},
{
"function_score": {
"query": {"match": {"tag": "tag"}},
"weight": 0.1,
}
},
]
}
},
"_source": False,
}
In other words, I want to know how this which is for ElasticSearch can be done in OpenSearch.
Edit 1:
I want to do this Elasticsearch new feature in OpenSearch. The question is how and also what does the query mentioned above does exactly.

First of all, searching multiple kNN fields in Elasticsearch is not yet supported.
Here you can find the development, not yet released, related to issue #91187 and PR #92118 that was merged for version 8.7... the current version is 8.6.
Looking at the OpenSearch documentation for k-NN, it does not appear to be supported there either.
However, regarding the query you provided:
knn search was not defined well... the right way is, for example:
{
"query": {
"knn": {
"my_vector": {
"vector": [2, 3, 5, 6],
"k": 2
}
}
}
}
where my_vector is the name of your vector field while vector is the query vector (i.e. query text encoded into the corresponding vectors) that must have the same number of dimensions as the vector field you are searching against.
the match query value was not defined well. Here the documentation.
the use of the function_score is unclear and not properly correct.
Finally, if you are interested in vector search with OpenSearch, we recently wrote a blog post in which we provide a detailed description of the new neural search plugin introduced with version 2.4.0 through an end-to-end testing experience.

Related

Filter documents out of the facet count in enterprise search

We use enterprise search indexes to store items that can be tagged by multiple tenants.
e.g
[
{
"id": 1,
"name": "document 1",
"tags": [
{ "company_id": 1, "tag_id": 1, "tag_name": "bla" },
{ "company_id": 2, "tag_id": 1, "tag_name": "bla" }
]
}
]
I'm looking to find a way to retrieve all documents with only the tags of company 1
This request:
{
"query": "",
"facets": {
"tags": {
"type": "value"
}
},
"sort": {
"created": "desc"
},
"page": {
"size": 20,
"current": 1
}
}
Is coming back with
...
"facets": {
"tags": [
{
"type": "value",
"data": [
{
"value": "{\"company_id\":1,\"tag_id\":1,\"tag_name\":\"bla\"}",
"count": 1
},
{
"value": "{\"company_id\":2,\"tag_id\":1,\"tag_name\":\"bla\"}",
"count": 1
}
]
}
],
}
...
Can I modify the request in a way such that I get no tags by "company_id" = 2 ?
I have a solution that involves modifying the results to strip the extra data after they are retrieved but I'm looking for a better solution.

How to filter by score in Elastic KNN search?

I have index with following mapping:
{
"test-2": {
"mappings": {
"properties": {
"advert_id": {
"type": "integer"
},
"fraud": {
"type": "boolean"
},
"photos": {
"properties": {
"id": {
"type": "integer"
},
"vector": {
"type": "dense_vector",
"dims": 3,
"index": true,
"similarity": "l2_norm"
}
}
},
"rating": {
"type": "long"
}
}
}
}
}
Here is how my data is saved in Elastic:
{
"advert_id": 123,
"fraud": true,
"photos": [
{
"id": 456,
"vector": [
213.32,
3.23,
4.21
]
}
]
}
I want to search data with similar vectors according to KNN algorithm. Here is my query for that:
GET /test-2/_knn_search
{
"knn": {
"field": "photos.vector",
"k": 1,
"num_candidates": 5,
"query_vector": [213.32, 3.23, 4.22]
}
}
Elastic returns me a score per each hit. Question is how can I get data with score more than N? It know about min_score, but couldn't apply it in this query.
Now that the kNN search API (/_knn_search) has been integrated into the search API (/_search) since Elasticsearch 8.4.0, we can use the min_score option as per the documentation as follows:
- GET /test-2/_knn_search
+ GET /test-2/_search
{
"knn": {
"field": "photos.vector",
"k": 1,
"num_candidates": 5,
"query_vector": [213.32, 3.23, 4.22]
},
+ "min_score": N
}

How to filter query based on a field value

I'm working with elasticsearch Query dsl, and I can't find a way to express the following:
Return results that have the field "price" > min budget and have "price" < max Budget and have has_price=true and also return all results that have "has_price=false"
In other words, I would like to use a range filter on results only that have has_price field set to true, otherwise, on results that have has_price set to false don't take in consideration the filter
Here's the mapping:
{
"formations": {
"mappings": {
"properties": {
"code": {
"type": "text"
},
"date": {
"type": "date",
"format": "dd/MM/yyyy"
},
"description": {
"type": "text"
},
"has_price": {
"type": "boolean"
},
"place": {
"type": "text"
},
"price": {
"type": "float"
},
"title": {
"type": "text"
}
}
}
}
}
The following query combines the 2 scenarios as 2 should clauses in a bool-query. And as there are only should clauses, minimum_should_match will be 1, meaning that at least one should-clause has to match:
Abstract Code Snippet
GET formations/_search
{
"query": {
"bool": {
"should": [
{ <1st scenario: has_price = false> },
{ <2nd scenario> has_price = true AND price IN budget_range}
]
}
}
}
Actual Sample Code Snippets
# 1. Create the index and populate it with some sample documents
POST formations/_bulk
{"index": {"_id": 1}}
{"has_price": true, "price": 2.0}
{"index": {"_id": 2}}
{"has_price": true, "price": 3.0}
{"index": {"_id": 3}}
{"has_price": true, "price": 4.0}
{"index": {"_id": 4}}
{"has_price": false, "price": 2.0}
{"index": {"_id": 5}}
{"has_price": false, "price": 3.0}
{"index": {"_id": 6}}
{"has_price": false, "price": 4.0}
# 2. Query assuming min_budget = 2.0 and max_budget = 4.0
GET formations/_search
{
"query": {
"bool": {
"should": [
{
"bool": {
"filter": {
"term": {
"has_price": false
}
}
}
},
{
"bool": {
"filter": [
{
"term": {
"has_price": true
}
},
{
"range": {
"price": {
"gt": 2,
"lt": 4
}
}
}
]
}
}
]
}
}
}
# 3. Result Snippet (4 hits: 3 from 1st scenario & 1 from 2nd scenario)
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
...
Don't forget to add the Claus "minimum_should_match": 1 to your bool-query in case you add another non-should-clause to your bool-query.
Let me know if this answers your question & solves your issue.

Sorting result based on dynamic terms

Imagine that I have an index with the following three documents representing images and their colors.
[
{
"id": 1,
"intensity": {
"red": 0.6,
"green": 0.1,
"blue": 0.3
}
},
{
"id": 2,
"intensity": {
"red": 0.5,
"green": 0.6,
"blue": 0.0
}
},
{
"id": 3,
"intensity": {
"red": 0.98,
"green": 0.0,
"blue": 0.0
}
}
]
It the user wants a "red image" (selected in a dropdown or in a “tag cloud”), it is very convenient to do a range query over the floats (maybe intensity.red > 0.5). I can also use the score of that query to get the "red-est" image ranked highest.
However, if I would like to offer free-text search, it gets harder. My solution to that would to index the documents as the following (eg use the if color > 0.5 then append(colors, color_name) at index time):
[
{
"id": 1,
"colors": ["red"]
},
{
"id": 2,
"colors": ["green", "red"]
}
{
"id": 3,
"colors": ["red"]
}
]
I could now use a query_string or a match on the colors field and then search for "red", but all of a sudden I lost my ranking possibilities. ID 3 is far more red than ID 1 (0.98 vs 0.6) but the score will be similar?
My question is: Can I have the cake and eat it too?
One solution I see is to have one index that turns free text into "keywords" which I later use in the actual search.
POST image_tag_index/_search {query: "redish"} -> [ "red" ]
POST images/_search {query: {"red" > 0.5}} -> [ {id: 1}, {id: 3}]
But then I need to fire two searches for every search, but maybe that is the only option?
You can make use of nested data type along with function_score query to get the desired result.
You need to change the way you are storing image data. The mapping will be as below:
PUT test
{
"mappings": {
"_doc": {
"properties": {
"id": {
"type": "integer"
},
"image": {
"type": "nested",
"properties": {
"color": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"intensity": {
"type": "float"
}
}
}
}
}
}
}
Index the image data as below:
PUT test/_doc/1
{
"id": 1,
"image": [
{
"color": "red",
"intensity": 0.6
},
{
"color": "green",
"intensity": 0.1
},
{
"color": "blue",
"intensity": 0.3
}
]
}
The above corresponds to the first image data the you posted in the question. Similarly you can index other images data.
Now when a user search for red the query should be build as below:
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "image",
"query": {
"function_score": {
"query": {
"bool": {
"must": [
{
"match": {
"image.color": "red"
}
},
{
"range": {
"image.intensity": {
"gt": 0.5
}
}
}
]
}
},
"field_value_factor": {
"field": "image.intensity",
"modifier": "none",
"missing": 0
}
}
}
}
}
]
}
}
}
You can see in the above query that I have used the field value of image.intensity to calculate the score.

Extract record from multiple arrays based on a filter

I have documents in ElasticSearch with the following structure :
"_source": {
"last_updated": "2017-10-25T18:33:51.434706",
"country": "Italia",
"price": [
"€ 139",
"€ 125",
"€ 120",
"€ 108"
],
"max_occupancy": [
2,
2,
1,
1
],
"type": [
"Type 1",
"Type 1 - (Tag)",
"Type 2",
"Type 2 (Tag)",
],
"availability": [
10,
10,
10,
10
],
"size": [
"26 m²",
"35 m²",
"47 m²",
"31 m²"
]
}
}
Basically, the details records are split in 5 arrays, and fields of the same record have the same index position in the 5 arrays. As can be seen in the example data there are 5 array(price, max_occupancy, type, availability, size) that are containing values related to the same element. I want to extract the element that has max_occupancy field greater or equal than 2 (if there is no record with 2 grab a 3 if there is no 3 grab a four, ...), with the lower price, in this case the record and place the result into a new JSON object like the following :
{
"last_updated": "2017-10-25T18:33:51.434706",
"country": "Italia",
"price: ": "€ 125",
"max_occupancy": "2",
"type": "Type 1 - (Tag)",
"availability": 10,
"size": "35 m²"
}
Basically the result structure should show the extracted record(that in this case is the second index of all array), and add the general information to it(fields : "last_updated", "country").
Is it possible to extract such a result from elastic search? What kind of query do I need to perform?
Could someone suggest the best approach?
My best approach: go nested with Nested Datatype
Except for easier querying, it easier to read and understand the connections between those objects that are, currently, scattered in different arrays.
Yes, if you'll decide this approach you will have to edit your mapping and re-index your entire data.
How would the mapping is going to look like? something like this:
{
"mappings": {
"properties": {
"last_updated": {
"type": "date"
},
"country": {
"type": "string"
},
"records": {
"type": "nested",
"properties": {
"price": {
"type": "string"
},
"max_occupancy": {
"type": "long"
},
"type": {
"type": "string"
},
"availability": {
"type": "long"
},
"size": {
"type": "string"
}
}
}
}
}
}
EDIT: New document structure (containing nested documents) -
{
"last_updated": "2017-10-25T18:33:51.434706",
"country": "Italia",
"records": [
{
"price": "€ 139",
"max_occupancy": 2,
"type": "Type 1",
"availability": 10,
"size": "26 m²"
},
{
"price": "€ 125",
"max_occupancy": 2,
"type": "Type 1 - (Tag)",
"availability": 10,
"size": "35 m²"
},
{
"price": "€ 120",
"max_occupancy": 1,
"type": "Type 2",
"availability": 10,
"size": "47 m²"
},
{
"price": "€ 108",
"max_occupancy": 1,
"type": "Type 2 (Tag)",
"availability": 10,
"size": "31 m²"
}
]
}
Now, its more easy to query for any specific condition with Nested Query and Inner Hits. for example:
{
"_source": [
"last_updated",
"country"
],
"query": {
"bool": {
"must": [
{
"term": {
"country": "Italia"
}
},
{
"nested": {
"path": "records",
"query": {
"bool": {
"must": [
{
"range": {
"records.max_occupancy": {
"gte": 2
}
}
}
]
}
},
"inner_hits": {
"sort": {
"records.price": "asc"
},
"size": 1
}
}
}
]
}
}
}
Conditions are: Italia AND max_occupancy > 2.
Inner hits: sort by price ascending order and get the first result.
Hope you'll find it useful

Resources