How to filter by score in Elastic KNN search? - elasticsearch

I have index with following mapping:
{
"test-2": {
"mappings": {
"properties": {
"advert_id": {
"type": "integer"
},
"fraud": {
"type": "boolean"
},
"photos": {
"properties": {
"id": {
"type": "integer"
},
"vector": {
"type": "dense_vector",
"dims": 3,
"index": true,
"similarity": "l2_norm"
}
}
},
"rating": {
"type": "long"
}
}
}
}
}
Here is how my data is saved in Elastic:
{
"advert_id": 123,
"fraud": true,
"photos": [
{
"id": 456,
"vector": [
213.32,
3.23,
4.21
]
}
]
}
I want to search data with similar vectors according to KNN algorithm. Here is my query for that:
GET /test-2/_knn_search
{
"knn": {
"field": "photos.vector",
"k": 1,
"num_candidates": 5,
"query_vector": [213.32, 3.23, 4.22]
}
}
Elastic returns me a score per each hit. Question is how can I get data with score more than N? It know about min_score, but couldn't apply it in this query.

Now that the kNN search API (/_knn_search) has been integrated into the search API (/_search) since Elasticsearch 8.4.0, we can use the min_score option as per the documentation as follows:
- GET /test-2/_knn_search
+ GET /test-2/_search
{
"knn": {
"field": "photos.vector",
"k": 1,
"num_candidates": 5,
"query_vector": [213.32, 3.23, 4.22]
},
+ "min_score": N
}

Related

Elasticsearch - Query to Determine All Unique IDs that are distance X away from a particular ID?

I have data in this format generated from a random walk (to simulate people walking around). It is set up in this manner { location : { lat: someLat, lon: someLong }, id: uniqueId, date:date }. I am trying to write a query given a users unique ID, find how many other unique IDs came within X distance of the given ID between a certain time range. Any hints on how to accomplish this?
My idea is to have a top level filter aggregration, with a nested geo-query of some sort. I think the geo-distance query is the way to go, but I am not sure how to include it into the below query to get all of unique IDs that come within X distance of the ID I am filtering on. The query below is where I am starting from, I am filtering all documents from now - 1 day to now, where the documents user Id is the provided value. How would I check all other documents for their distances against documents that match this query?
{
"aggs" : {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyyy",
"ranges": [
{ "to": "now" },
{ "from": "now-1d" }
]
}
},
"locations" : {
"filter" : {
"term": { "id.keyword": "7a50ab18-886b-42a2-80ad-3d45112e3cfd" }
}
}
}
}
Your hunch is correct. All of this can be done using range & geo_distance filtering and _geo_distance sorting. You wanna filter on the query-level, not in the aggs though:
GET walking/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"date": {
"gte": "now-1d"
}
}
}
],
"filter": [
{
"geo_distance": {
"distance": "20m",
"location": {
"lat": 48.20150179951008,
"lon": 16.39111876487732
}
}
}
]
}
},
"aggs": {
"rings_around_loc": {
"geo_distance": {
"field": "location",
"origin": {
"lat": 48.20150179951008,
"lon": 16.39111876487732
},
"unit": "m",
"keyed": true,
"ranges": [
{
"to": 10
},
{
"from": 10,
"to": 50
},
{
"from": 50
}
]
}
},
"locations": {
"value_count": {
"field": "id.keyword"
}
}
},
"sort": [
{
"_geo_distance": {
"location": {
"lat": 48.20150179951008,
"lon": 16.39111876487732
},
"order": "asc",
"unit": "m",
"mode": "min",
"distance_type": "arc",
"ignore_unmapped": true
}
}
]
}
Not sure what you need the range buckets for so I left them out.
Full steps to replicate:
PUT walking
{
"mappings": {
"properties": {
"date": {
"type": "date"
},
"id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"location": {
"type": "geo_point"
}
}
}
}
And then POST _bulk this random walk data

How to filter query based on a field value

I'm working with elasticsearch Query dsl, and I can't find a way to express the following:
Return results that have the field "price" > min budget and have "price" < max Budget and have has_price=true and also return all results that have "has_price=false"
In other words, I would like to use a range filter on results only that have has_price field set to true, otherwise, on results that have has_price set to false don't take in consideration the filter
Here's the mapping:
{
"formations": {
"mappings": {
"properties": {
"code": {
"type": "text"
},
"date": {
"type": "date",
"format": "dd/MM/yyyy"
},
"description": {
"type": "text"
},
"has_price": {
"type": "boolean"
},
"place": {
"type": "text"
},
"price": {
"type": "float"
},
"title": {
"type": "text"
}
}
}
}
}
The following query combines the 2 scenarios as 2 should clauses in a bool-query. And as there are only should clauses, minimum_should_match will be 1, meaning that at least one should-clause has to match:
Abstract Code Snippet
GET formations/_search
{
"query": {
"bool": {
"should": [
{ <1st scenario: has_price = false> },
{ <2nd scenario> has_price = true AND price IN budget_range}
]
}
}
}
Actual Sample Code Snippets
# 1. Create the index and populate it with some sample documents
POST formations/_bulk
{"index": {"_id": 1}}
{"has_price": true, "price": 2.0}
{"index": {"_id": 2}}
{"has_price": true, "price": 3.0}
{"index": {"_id": 3}}
{"has_price": true, "price": 4.0}
{"index": {"_id": 4}}
{"has_price": false, "price": 2.0}
{"index": {"_id": 5}}
{"has_price": false, "price": 3.0}
{"index": {"_id": 6}}
{"has_price": false, "price": 4.0}
# 2. Query assuming min_budget = 2.0 and max_budget = 4.0
GET formations/_search
{
"query": {
"bool": {
"should": [
{
"bool": {
"filter": {
"term": {
"has_price": false
}
}
}
},
{
"bool": {
"filter": [
{
"term": {
"has_price": true
}
},
{
"range": {
"price": {
"gt": 2,
"lt": 4
}
}
}
]
}
}
]
}
}
}
# 3. Result Snippet (4 hits: 3 from 1st scenario & 1 from 2nd scenario)
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
...
Don't forget to add the Claus "minimum_should_match": 1 to your bool-query in case you add another non-should-clause to your bool-query.
Let me know if this answers your question & solves your issue.

Elasticsearch: Retrieving filtered and unfiltered count in one request

I am using the following mapping in one of my ElasticSearch indices:
"mappings": {
"my-mapping": {
"properties": {
"id": {
"type": "keyword"
},
"groupId": {
"type" : "keyword"
}
"title": {
"type": "text"
}
}
}
}
I now want to count elements matching to a search string which may be present inside of "title", grouped by my groupId. I can achieve that using aggregations and buckets:
/indexname/_search
{
"query" : {
"term" : {
"title" : "sky"
}
},
"aggs": {
"filtered_buckets": {
"terms": {
"field": "groupId"
}
}
}
}
Additionally, I want to know the count of all elements not respecting the filter. I could simply achieve that using a non-queried search:
/indexname/_search
{
"aggs": {
"filtered_buckets": {
"terms": {
"field": "groupId"
}
}
}
}
Current problem is: Is there any possibility to generate aggregation data containing the filtered count and the unfiltered count of only those groups which had a hit before - in one request?
For example:
"buckets": [
{
"key": "257786",
"doc_count": 3024,
"filtered_doc_count" : 202
},
{
"key": "254640",
"doc_count": 3010
"filtered_doc_count" : 1
},
{
"key": "252256",
"doc_count": 2367
"filtered_doc_count" : 5
},
...
]
One way I see is splitting the requests in two while first requesting all filtered buckets (their IDs) and then requesting the counts of these specific buckets using "terms" : { "id" : ["4", "65", "404"] }. This is not very nice and I don't want to request twice (_msearch does not help here).
Second bad solution would be to persist the all-counts somewhere in all of my entities.
Is there any way to achieve what I described in a single request?
PS: Please correct me, if the question is unclear.
Based on these:
How to filter terms aggregation
http://nocf-www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
I made this:
PUT test
{
"mappings": {
"my-mapping": {
"properties": {
"id": {
"type": "keyword"
},
"groupId": {
"type" : "keyword"
},
"title": {
"type": "text"
}
}
}
}
}
PUT test/type1/1
{
"id":1,
"groupId": 1,
"title": "asd"
}
PUT test/type1/2
{
"id":2,
"groupId": 1,
"title": "sky"
}
PUT test/type1/3
{
"id":3,
"groupId": 2,
"title": "sky"
}
PUT test/type1/4
{
"id":4,
"groupId": 2,
"title": "sky"
}
PUT test/type1/5
{
"id":5,
"groupId": 2,
"title": "sky"
}
POST test/type1/_search
{
"aggs": {
"categories-filtered": {
"filter": {"term": {"title": "sky"}},
"aggs": {
"names": {
"terms": {"field": "groupId"}
}
}
},
"categories": {
"terms": {"field": "groupId"}
}
}
}

Elasticsearch: Why can't I use "5m" for precision in context queries?

I'm running on Elasticsearch 5.5
I have a document with the following mapping
"mappings": {
"shops": {
"properties": {
"locations": {
"type": "geo_point"
},
"name": {
"type": "keyword"
},
"suggest": {
"type": "completion",
"contexts": [
{
"name": "location",
"type": "GEO",
"precision": "10m",
"path": "locations"
}
]
}
}
}
I'll add a document as follows:
PUT my_index/shops
{
"name":"random shop",
"suggest":{
"input":"random shop"
},
"locations":[
{
"lat":42.38471212,
"lon":-71.12612357
}
]
}
I try to query for the document with the follow JSON call
GET my_shops/_search
{
"suggest": {
"result": {
"prefix": "random",
"completion": {
"field": "suggest",
"size": 5,
"fuzzy": true,
"contexts": {
"location": [{
"lat": 42.38471212,
"lon": -71.12612357,
"precision": "10mi"
}]
}
}
}
}
}
I get the following errors:
(source: discourse.org)
But when I change the "precision" field to an int, I get the intended search results.
I'm confused on two fronts.
Why is there a context error? The documentation seems to say that this is ok
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/suggester-context.html
Why can't I use string values for the precision values?
At the bottom of the page, I see that the precision values can take either distances or numeric values.

Elasticsearch unexpected results when sorting against deeply nested attributes

I'm trying to perform some sorting based on the attributes of a document's deeply nested children.
Let's say we have an index filled with publisher documents. A publisher has a collection of books, and
each book has a title, a published flag, and a collection of genre scores. A genre_score represents how well
a particular book matches a particular genre, or in this case a genre_id.
First, let's define some mappings (for simplicity, we will only be explicit about the nested types):
curl -XPUT 'localhost:9200/book_index' -d '
{
"mappings": {
"publisher": {
"properties": {
"books": {
"type": "nested",
"properties": {
"genre_scores": {
"type": "nested"
}
}
}
}
}
}
}'
Here are our two publishers:
curl -XPUT 'localhost:9200/book_index/publisher/1' -d '
{
"name": "Best Books Publishing",
"books": [
{
"name": "Published with medium genre_id of 1",
"published": true,
"genre_scores": [
{ "genre_id": 1, "score": 50 },
{ "genre_id": 2, "score": 15 }
]
}
]
}'
curl -XPUT 'localhost:9200/book_index/publisher/2' -d '
{
"name": "Puffin Publishers",
"books": [
{
"name": "Published book with low genre_id of 1",
"published": true,
"genre_scores": [
{ "genre_id": 1, "score": 10 },
{ "genre_id": 4, "score": 10 }
]
},
{
"name": "Unpublished book with high genre_id of 1",
"published": false,
"genre_scores": [
{ "genre_id": 1, "score": 100 },
{ "genre_id": 2, "score": 35 }
]
}
]
}'
And here is the final definition of our index & mappings...
curl -XGET 'localhost:9200/book_index/_mappings?pretty=true'
...
{
"book_index": {
"mappings": {
"publisher": {
"properties": {
"books": {
"type": "nested",
"properties": {
"genre_scores": {
"type": "nested",
"properties": {
"genre_id": {
"type": "long"
},
"score": {
"type": "long"
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"published": {
"type": "boolean"
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
Now suppose we want to query for a list of publishers, and have them sorted by those who books performing
well in a particular genre. In other words, sort the publishers by the genre_score.score of one of their books
for the target genre_id.
We might write a search query like this...
curl -XGET 'localhost:9200/book_index/_search?pretty=true' -d '
{
"size": 5,
"from": 0,
"sort": [
{
"books.genre_scores.score": {
"order": "desc",
"nested_path": "books.genre_scores",
"nested_filter": {
"term": {
"books.genre_scores.genre_id": 1
}
}
}
}
],
"_source":false,
"query": {
"nested": {
"path": "books",
"query": {
"bool": {
"must": []
}
},
"inner_hits": {
"size": 5,
"sort": []
}
}
}
}'
Which correctly returns the Puffin (with a sort value of [100]) first and Best Books second (with a sort value of [50]).
But suppose we only want to consider books for which published is true. This would change our expectation to have Best Books first (with a sort of [50]) and Puffin second (with a sort of [10]).
Let's update our nested_filter and query to the following...
curl -XGET 'localhost:9200/book_index/_search?pretty=true' -d '
{
"size": 5,
"from": 0,
"sort": [
{
"books.genre_scores.score": {
"order": "desc",
"nested_path": "books.genre_scores",
"nested_filter": {
"bool": {
"must": [
{
"term": {
"books.genre_scores.genre_id": 1
}
}, {
"term": {
"books.published": true
}
}
]
}
}
}
}
],
"_source": false,
"query": {
"nested": {
"path": "books",
"query": {
"term": {
"books.published": true
}
},
"inner_hits": {
"size": 5,
"sort": []
}
}
}
}'
Suddenly, our sort values for both publishers has become [-9223372036854775808].
Why does adding an additional term to our nested_filter in the top-level sort have this impact?
Can anyone provide some insight as to why this behavior is happening? And additionally, if there are any viable solutions to the proposed query/sort?
This occurs in both ES1.x and ES5
Thanks!

Resources