How to combine KNN search and msearch in Elasticsearch

How to combine KNN search and msearch in Elasticsearch - elasticsearch

How can I perform KNN search on multiple queries efficiently using Elasticsearch's msearch API? I have a large number of queries that I need to run and I'd like to avoid looping over each query individually.
Each query is a vector, so KNN search is a must. However, I can't find any examples or documentation on how to combine msearch with KNN search in Elasticsearch. Can someone provide an example or point me to relevant documentation on this topic?

I got the example Aproximation KNN. I put two queries in multisearch.
GET image-index/_msearch
{ }
{"knn":{"field":"image-vector","query_vector":[-5,9,-12],"k":10,"num_candidates":100},"fields":["title","file-type"]}
{"index": "new index "}
{"query" : {"match_all" : {}}}

Related

Can elasticsearch percolator give word position?

I understand that there is term vector in elastic search which can give the word position and other stats.
Can percolator give the word position in the documents that are being searched on?
I understand that the documents are not indexed and only percolator queries are indexed. I see the below
If the requested information wasn’t stored in the index, it will be computed on the fly if possible. Additionally, term vectors could be computed for documents not even existing in the index, but instead provided by the user.
in - https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html
So interested to know if elastic search can calculate the word position on the fly?
Any leads are appreciated. Thanks for reading.

#Kaveh
Thanks for taking time for me but really sorry I don't see how this (https://stackoverflow.com/a/67926555/4068218) is related because using artificial documents I can get the stats - https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-termvectors.html
but what I have is percolator - https://www.youtube.com/watch?v=G2Ru2KV0DZg
So even if I get the term vector on fly using artificial documents or by the /_analyze it does not matter as they will not give me the position of terms (in percolator)
eg Percolator - I am trying to find the word - Hello.
My document has the below field and value
"text": "Hello World"
If I used artificial documents or /_analyze it will say 0 - Hello 1- World but when I percolate I will get the
percolate query that found the word Hello. I want to combine both and want the percolator tell
"I found Hello in position 0"

As you can see in the documentation for term vector if you store _source the Elastic can calculate the term vector on fly. It will analyze your text based on the source and it will aggregate it with existing term vector of index.
If you want to get the result for term you always can get your analyzed data for list of terms for more information here.

How to boost most popular (top quartile) in elasticsearch query results (outliers)

I have an elasticsearch query that includes bool - must / should sections that I have refined to match search terms and boost for terms in priority fields, phrase match, etc.
I would like to boost documents that are the most popular. The documents include a field "popularity" that indicates the number of times the document was viewed.
Preferably, I would like to boost any documents in the result set that are outliers - meaning that the popularity score is perhaps 2 standard deviations from the average in the result set.
I see aggregations but I'm interested in boosting results in a query, not a report/dashboard.
I also noted the new rank_feature query in ES 7 (I am still on 6.8 but could upgrade). It looks like the rank_feature query looks across all documents, not the result set.
Is there a way to do this?

I think that you want to use a rank or a range query in a "rescore query".
If your need is to specific for classical queries, you can use a "function_score" query in your rescore and use a script to write your own score calculation
https://www.elastic.co/guide/en/elasticsearch/reference/7.9/filter-search-results.html
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-rescore.html

Speeding up elasticsearch more_like_this query

I was interested in fetching similar documents for a given input document (similar to KNN). As vectorizing documents (using doc2vec) that are not similar in sizes would result in inconsistent document vectors, and then computing a vector for the user's input (which maybe just a few terms/sentences compared the docs on which the doc2vec model was trained on where each doc would consist of 100s or 1000s of words) trying to find k-Nearest Neighbours would produce incorrect results due to lack of features.
Hence, I went ahead with using more_like_this query, which does a similar job compared to kNN, irrespective of the size of the user's input, since I'm interested in analyzing only text fields.
But I was concerned about the performance when I have millions of documents indexed in elasticsearch. The documentation says that using term_vector to store the term vectors at the index time can speed up the analysis.
But what I don't understand is which type of term vector the documentation refers to in this context. As there are three different types of term vectors: term information, term statistics, and field statistics.
And term statistics and field statistics compute the frequency of the terms with respect to other documents in the index, wouldn't these vectors be outdated when I introduce new documents in the index.
Hence I presume that the more_like_this documentation refers to the term information (which is the information of the terms in one particular document irrespective of the others).
Can anyone let me know if computing only the term information vector at the index time is sufficient to speed up more_like_this?

There shouldn't be any worries about term vectors being outdated, since they are stored for every document, so they will be updated respectively.
For More Like This it will be enough just to have term_vectors:yes, you don't need to have offsets and positions. So, if you don't plan using highlighting, you should be fine with just default one.
So, for your text field, you would need to have mappings like this and it will be enough to speed up MLT execution:
{
"mappings": {
"properties": {
"text": {
"type": "text",
"term_vector": "yes"
}
}
}
}

Elasticsearch fuzzy query - max_expansions

I am using elasticsearch 5+, I did some queries using fuzzy.
I Understood about the follows fuzzy parameters:
fuzziness, prefix_length.
But, I can not understand about "max_expansions", I read many articles, but it is hard to me because there are fews examples about it.
Can you explanation me about this parameter using examples? How it works together fuzziness parameter?
Write an example:
I did this query:
GET my-index/my-type/_search
{
"query": {
"fuzzy": {
"my-field": {
"value": "house",
"fuzziness": 1,
"prefix_length": 0,
"max_expansions": 1
}
}
}
}
I have 4 shards, my query found 6 results, because there are 6 documents with "hous" in the "my-field".
If max_expansions it is like as limit in database, the max result should be 4 (because I have 4 shards)? Why return 6 results?

A quote from Elasticsearch blog post:
The max_expansions setting, which defines the maximum number of terms the fuzzy query will match before halting the search, can also have dramatic effects on the performance of a fuzzy query. Cutting down the query terms has a negative effect, however, in that some valid results may not be found due to early termination of the query. It is important to understand that the max_expansions query limit works at the shard level, meaning that even if set to 1, multiple terms may match, all coming from different shards. This behavior can make it seem as if max_expansions is not in effect, so beware that counting unique terms that come are returned is not a valid way to determine if max_expansions is working.
Basically it means that under the hood in one step when Elasticsearch is triggering fuzzy query it is limiting the number of terms considered in search to the max_expansions. As it was written it is not so obvious as for example limit in databases because here, in Elasticsearch it is working on shards. Probably more expected results you will have setting up Elasticsearch only with one shard locally and testing the behavior.

Fast keyword extraction in elasticsearch

I have large database of annotations of images stored in an elasticsearch database. I want to use this database for keyword extraction. Input is text (typically a newspaper article). My basic idea for an algorithm is to go through each term from the article and use elasticsearch to discover how frequent the term is in the image annotations. Then output terms from articles which are not frequent (in order to prefer names of people or places over common English words).
I don't need something very sophisticated, these keywords are used just as suggestion for user input, but I want something faster then asking N search queries (where N is number of terms in text) to elasticsearch which can be slow on large texts. Is there some robust and fast technique for keyword extraction in elasticsearch?

You can use elastic search term aggregations for this. They can return bucketed keywords with document counts which indicate their relative frequency. Here is an example query in YML.
query:
match:
annotation:
query: text of your article
aggregations:
term_frequencies:
terms:
field: annotation

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio