ElasticSearch _knn_search query on multiple fields

ElasticSearch _knn_search query on multiple fields - elasticsearch

I'm using ES 8.2. I'd like to use approximate method of _knn_search on more than 1 vector. Below I've attached my current code searching on a single vector. So far as I've read _knn_search does not support search on nested fields.
Alternatively, I can use multi index search. One index, one vector, one search, sum up all results together. However, I need to store all these vectors together in one index as I need also to perform filtration on some other fields besides vectors for knn search.
Thus, the question is if there is a work around how I can perform _knn_search on more than 1 vector?
search_vector = np.zeros(512).tolist()
es_query = {
"knn": {
"field": "feature_vector_1.vector",
"query_vector": search_vector,
"k": 100,
"num_candidates": 1000
},
"filter": [
{
"range": {
"feature_vector_1.match_prc": {
"gt": 10
}
}
}
],
"_source": {
"excludes": ["feature_vector_1.vector", "feature_vector_2.vector"]
}
}

The last working query that I've end up with is
es_query = {
"knn": {
"field": "feature_vector_1.vector",
"query_vector": search_vector,
"k": 1000,
"num_candidates": 1000
},
"filter": [
{
"function_score": {
"query": {
"match_all": {}
},
"script_score": {
"script": {
"source": """
double value = dotProduct(params.queryVector, 'feature_vector_2.vector');
return 100 * (1 + value) / 2;
""",
"params": {
"queryVector": search_vector
}
},
}
}
}
],
"_source": {
"excludes": ["feature_vector_1.vector", "feature_vector_2.vector"]
}
}
However, it is not true AKNN on 2 vectors but still working option if performance of such query satisfies your expectations.

the below seems to be working for me for combining KNN searches, taking the average of multiple cosine similarity scores. Note that this is a little different than the original request, since it performs a brute force search, but you can still filter the results up front by replacing the match_all bit.
GET my-index/_search
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "(cosineSimilarity(params.vector1, 'my-vector1') + cosineSimilarity(params.vector2, 'my-vector2'))/2 + 1.0",
"params": {
"vector1": [
1.3012068271636963,
...
0.23468133807182312
],
"vector2": [
-0.49404603242874146,
...
-0.15835021436214447
]
}
}
}
}
}

Related

How to sum the size of documents within a time interval?

I'm attempting to estimate the sum of size of n documents across an index using below query :
GET /events/_search
{
"query": {
"bool":{
"must": [
{"range": {"ts": {"gte": "2022-10-10T00:00:00Z", "lt": "2022-10-21T00:00:00Z"}}}
]
}
},
"aggs": {
"total_size": {
"sum": {
"field": "doc['_source'].bytes"
}
}
}
}
This returns documents but the size of the aggregation is 0 :
"aggregations" : {
"total_size" : {
"value" : 0.0
}
}
How to sum the size of documents within a time interval ?

The best way to achieve what you want is to actually add another field that contains the real source size at indexing time.
However, if you want to run it once to see how it looks like, you can leverage runtime fields to compute this at search time, just know that it can put a heavy burden on your cluster. Since the Painless scripting language doesn't yet provide a way to transform the source document to the same JSON you sent at indexing time, we can only approximate the value you're looking for by stringifying the _source Hashmap, yielding this:
GET /events/_search
{
"runtime_mappings": {
"source.size": {
"type": "double",
"script": """
def size = params._source.toString().length() * 8;
emit(size);
"""
}
},
"query": {
"bool":{
"must": [
{"range": {"ts": {"gte": "2022-10-10T00:00:00Z", "lt": "2022-10-21T00:00:00Z"}}}
]
}
},
"aggs": {
"size": {
"sum": {
"field": "source.size"
}
}
}
}
Another way is to install the Mapper size plugin so that you can make use of the _size field computed at indexing time.

ElasticSearch function score query (range filter)

I want to use document scoring instead of filtering.
As a user I can enter something like buyingPrice (from-to) 50-150€.
This works well with origin,offset,scale - e.g.:
gauss:{
buyingPrice:{
origin:100€
offset:100€
scale:200€
}
}
}
Problem is now, when a user only enters one side - e.g. from 50€
Expected behavior would be, that all buyingPrices above 50€ get full score. The ones below 50€ get a score lower than the full one.
How can I achieve that with ElasticSearch?

You can add a filter inside function score, so function score will only affect those documents
{
"query": {
"function_score": {
"functions": [
{...}, --> other functions
{
"filter": {
"range": {
"price": {
"lte": 50
}
}
},
"gauss": {
"price": {
"origin": 50,
"offset": 0,
"scale": 200
}
}
}
]
}
}

Compute percentile with collapsing by user

Let says I have an index where I save a million of tweets (original object). I want to get the 90th percentile users based on the number of followers.
I know there is the aggregation "percentile" to do this, but my problem is that ElasticSearch use all documents so I have some users that tweet a lot who noise my calculation.
I want to isolate all unique user then compute the 90th.
The other constraint is that I want to do this in only one or two requests to keep the response lower than 500ms.
I have tried a lot of things and I was able to do this with "scripted_metric" but when my dataset exceed 100k of tweets the performances go down criticaly.
Any advice ?
Additionnal infos :
My index store orginal tweets & retweets based on user search queries
The index is mapped with a dynamic template mapping (No problem with this)
The index contains approximatly 100M
Unfortunately, "top hits" aggregation doesn't accept sub-aggs.
The request I try to achieve is :
{
"collapse": {
"field": "user.id" <--- I want this effect on aggregation
},
"query": {
"bool": {
"must": [
{
"term": {
"metadatas.clientId": {
"value": projectId
}
}
},
{
"match": {
"metadatas.blacklisted": false
}
}
],
"filter": [
{
"range": {
"publishedAt": {
"gte": "now-90d/d"
}
}
}
]
}
},
"aggs":{
"twitter": {
"percentiles": {
"field": "user.followers_count",
"percents": [95]
}
}
},
"size": 0
}

Finally, I figure out to find a workaround.
In percentile aggregation, I can use a script. I use params variable to hold unique keys then return preceding _score.
Without the complete explanation of the computation, I cannot fine tune the behavior of my script. But the result is good enough for me.
"aggs": {
"unique":{
"cardinality": {
"field": "collapse_profile"
}
},
"thresholds":{
"percentiles": {
"field": "user.followers_count",
"percents": [90],
"script": {
"source": """
if(params.keys == null){
params.keys = new HashMap();
}
def key = doc['user.id'].value;
def value = doc['user.followers_count'].value;
if(params.keys[key] == null){
params.keys[key] = _score;
return value;
}
return _score;
""",
"lang": "painless"
}
}
}
}

Query return the search difference on elasticsearch

How would the following query look:
Scenario:
I have two bases (base 1 and 2), with 1 column each, I would like to see the difference between them, that is, what exists in base 1 that does not exist in base 2, considering the fictitious names of the columns as hostname.
Example:
Selected value of Base1.Hostname is for Base2.Hostname?
YES → DO NOT RETURN
NO → RETURN
I have this in python for the following function:
def diff(first, second):
second = set (second)
return [item for item in first if item not in second]
Example match equal:
GET /base1/_search
{
"query": {
"multi_match": {
"query": "webserver",
"fields": [
"hostname"
],
"type": "phrase"
}
}
}
I would like to migrate this architecture to elastic search in order to generate forecast in the future with the frequency of change of these search in the bases

This could be done with aggregation.
Collect all the hostname from base1 & base2 index
For each hostname count occurrences in base2
Keep only the buckets that have base2 count 0
GET base*/_search
{
"size": 0,
"aggs": {
"all": {
"composite": {
"size": 10,
"sources": [
{
"host": {
"terms": {
"field": "hostname"
}
}
}
]
},
"aggs": {
"base2": {
"filter": {
"match": {
"_index": "base2"
}
}
},
"index_count_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"base2_count": "base2._count"
},
"script": "params.base2_count == 0"
}
}
}
}
}
}
By the way don't forget to use pagination to get rest of the result.
References :
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html
https://discuss.elastic.co/t/data-set-difference-between-fields-on-different-indexes/160015/4

Filter/aggregate one elasticsearch index of time series data by timestamps found in another index

The Data
So I have reams of different types of time series data. Currently i've chosen to put each type of data into their own index because with the exception of 4 fields, all of the data is very different. Also the data is sampled at different rates and are not guaranteed to have common timestamps across the same sub-second window so fusing them all into one large document is also not a trivial task.
The Goal
One of our common use cases that i'm trying to see if I can solve entirely in Elasticsearch is to return an aggregation result of one index based on the time windows returned from a query of another index. Pictorially:
This is what I want to accomplish.
Some Considerations
For small enough signal transitions on the "condition" data, I can just use a date histogram and some combination of a top hits sub aggregation, but this quickly breaks down when I have 10,000's or 100,000's of occurrences of "the condition". Further this is just one "case", I have 100's of sets of similar situations that i'd like to get the overall min/max from.
The comparisons are basically amongst what I would consider to be sibling level documents or indices, so there doesn't seem to be any obvious parent->child relationship that would be flexible enough over the long run, at least with how the data is currently structured.
It feels like there should be an elegant solution instead of brute force building the date ranges outside of Elasticsearch with the results of one query and feeding 100's of time ranges into another query.
Looking through the documentation it feels like some combination of Elasticsearch scripting and some of the pipelined aggregations are going to be what i want, but no definitive solutions are jumping out at me. I could really use some pointers in the right direction from the community.
Thanks.

I found a "solution" that worked for me for this problem. No answers or even comments from anyone yet, but i'll post my solution in case someone else comes along looking for something like this. I'm sure there is a lot of opportunity for improvement and optimization and if I discover such a solution (likely through a scripted aggregation) i'll come back and update my solution.
It may not be the optimal solution but it works for me. The key was to leverage the top_hits, serial_diff and bucket_selector aggregators.
The "solution"
def time_edges(index, must_terms=[], should_terms=[], filter_terms=[], data_sample_accuracy_window=200):
"""
Find the affected flights and date ranges where a specific set of terms occurs in a particular ES index.
index: the Elasticsearch index to search
terms: a list of dictionaries of form { "term": { "<termname>": <value>}}
"""
query = {
"size": 0,
"timeout": "5s",
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
"aggs": {
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
}
}
}
}
}
return es.search(index=index, body=query)
Breaking things down
Get filter the results by 'Index 2'
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
must_terms is the required value to be able to get all the results for "the condition" stored in "Index 2".
For example, to limit results to only the last 10 days and when condition is the value 10 or 12 we add the following must_terms
must_terms = [
{
"range": {
"#timestamp": {
"gte": "now-10d",
"lte": "now"
}
}
},
{
"terms": {"condition": [10, 12]}
}
]
This returns a reduced set of documents that we can then pass on into our aggregations to figure out where our "samples" are.
Aggregations
For my use case we have the notion of "flights" for our aircraft, so I wanted to group the returned results by their id and then "break up" all the occurences into buckets.
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
...
}
}
}
You can get the rising edge of the first occurence and the falling edge of the last occurence using the top_hits aggregation
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
You can get the samples in between using a histogram on a timestamp. This breaks up your returned results into buckets for every unique timestamp. This is a costly aggregation, but worth it. Using the inline script allows us to use the timestamp value for the bucket name.
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
...
}
By default the histogram aggregation returns a set of buckets with the document count for each bucket, but we need a value. This is what is required for serial_diff aggregation to work, so we have to do a token max aggregation on the results to get a value returned.
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
...
}
We use the results of the serial_diff to determine whether or not two bucket are approximately adjacent. We then discard samples that are adjacent to eachother and create a combined time range for our condition by using the bucket_selector aggregation. This will throw out buckets that are smaller than our data_sample_accuracy_window. This value is dependent on your dataset.
"aggs": {
...
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
The serial_diff results are also critical for us to determine how long our condition was set. The timestamps of our buckets end up representing the "rising" edge of our condition signal so the falling edge is unknown without some post-processing. We use the timestampDiff value to figure out where the falling edge is.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio