What should be the value of max_gram and min_gram in Elastic search - elasticsearch

I have a question regarding ngram configurations. Elastic search documentation says
It usually makes sense to set min_gram and max_gram to the same value.
Perhaps, too much of difference between min and max grams will increase the index storage.
But there are many blogs which are using max_gram as 8 or 20 to get higher accurate results.
I am confused between the two. Which should be the one to use?
What are pros and cons of both?
Note: My use case deals with indexing of article. Article content is usually of size 150KB.
Thanks

Analyze your search query. Find what type of like query is coming frequently, what is maximum length of search phrase and minimum length, is it case sensitive? Which is the field, Which having similar data? If data is similar, It will not take more storage.
You need to analyze your data and their relationship among them. Analyze your query behavior. Know your search query . Once you have all these information, You can take better decision or you can find some better way to solve it.
This article can help you: https://medium.com/#ashishstiwari/what-should-be-the-value-of-max-gram-and-min-gram-in-elasticsearch-f091404c9a14

Related

Does elastic search use previous search frequencies?

Does elastic search utilize the frequency of a previously searched document. For example there are document A and document B. Both have similar score in terms of edit distances and other metrics however document A is very frequently searched and B is not. Will elastic search score A better than B. If not, how to acheive this?
Elasticsearch does not change score based on previous searches in its default scoring algorithm. In fact, this is really a question about Lucene scoring, since Elasticsearch uses it for all of the actual Search logic.
I think you may be looking at this from the wrong viewpoint. Users search with a query, and Elasticsearch recommends documents. You have no way of knowing if the document it recommended was valid or not just based on the search. I think your question should really be, "How can I tune Search relevance in an intelligent way based on user data?".
Now, there are a number of ways you can achieve this, but they require you to gather user data and build the model yourself. So unfortunately, there is no easy way.
However, I would recommend taking a look at https://www.elastic.co/app-search/, which offers a managed solution with lots of custom relevant tuning which may save you lots of time depending on your use case.

Hold Elasticsearch document frequency constant as index changes

I'm using Elasticsearch to retrieve XML documents by terms. I have multiple indexes, one for each day. I have a large collection of documents that is, in some sense, representative. The document frequency of several terms varies from day to day.
The mathching I'm doing depends on inverse document frequency of terms. I'd like to not use the IDF of the indices I'm searching, and instead use the IDF based on the large, representative set. Is there a straightforward way to do this without writing custom scoring functions for large, complex queries?
There is no other way.
FWIW , To access and use IDF , you need to write a custom script Engine in elasticsearch, and probably use that engine based script for sorting.

Elasticsearch: Is there a way to turn off scoring, to gain performance?

I have over 33m records in my Elasticsearch 7.1 index and when I query it, I limit the result size to 20. However, ES still scores the records internally. But this isn't important for me and in fact, I want any 20 results. So for example I don't care if some of the results are more relevant.
My question is, is there a way to turn this behaviour off, and if so, will it improve the performance?
Kind regards,
R.
You can use _doc as a sort field. This will make ES return the fields sorted in the order of insertion, and hence it will not do scoring.
Here is a thread from the forums that explains more:
https://discuss.elastic.co/t/most-efficient-way-to-query-without-a-score/57457/4

Using ElasticSearch to find trends

I have implemented elastic search for my search and have a good feeling it could easily be leveraged for finding trends , but its a the tip of my tongue how one would start to go about such a thing. Can anyone point me in the right direction or give me some keywords to look into further that might make this possible?
Depending on what you want to do exactly, it might be usefull to have a look at the Aggregations in Elasticsearch?
If you e.g. combine the Significant Terms Aggregation with a query for a given time frame, you will get back the terms, which are common in the given time frame, but rather unusual for the rest of the dataset (basically the trends in your dataset).

Lucene - limit amount of results for specific term in search query

The problem is that one of our terms could be very common (for example number "3"). In that case I would like to limit the amount of search result Scored while Lucene is running the Query. Is that even possible?
Just to emphasize - I don't want just to limit Lucene search results (that could easily be done using second number parameter in IndexSearher.Search method). I want to tell Lucene something like - don't spent too much time searching hits for that specific term. In case you found, let's say, a 1,000,000 - stop looking and go to other terms.
No, you can't. As you might know, absolute scores are meaningless in Lucene, so there's no support for them.
Because the term is really common, the idf will be high (or low, depending on your perspective) so it will probably be relatively inconsequential due to Lucene's pruning algorithms. You can always change the boost to make it matter even less, but I'd double check that this is really your performance bottleneck.

Resources