How can I get fuzzy results sorted by relevance using ElasticSearch? - elasticsearch

Baiscally if I search for something like
neckla
I want to get results for necklace. When i did a fuzzy search, I got results for food and others (basically it wasn't filtering properly) although the necklace result was there.
Any advice?

Have you adjusted the min_similarity parameter in your fuzzy query? Increasing the min_similarity may filter out the food and only return the necklace.
{
"fuzzy" : {
"jewelry" : {
"value" : "neckla",
"boost" : 1.0,
"min_similarity" : 0.8,
"prefix_length" : 0
}
}
}
As mentioned in the answer to this question, an edgeNGram filter may be your best bet here.

Related

Is there a difference between using search terms and should when querying Elasticsearch

I am performing a refactor of the code to query an ES index, and I was wondering if there is any difference between the two snippets below:
"bool" : {
"should" : [ {
"terms" : {
"myType" : [ 1 ]
}
}, {
"terms" : {
"myType" : [ 2 ]
}
}, {
"terms" : {
"myType" : [ 4 ]
}
} ]
}
and
"terms" : {
"myType" : [ 1, 2, 4 ]
}
Please check this blog from Elastic discuss page which will answer your question. Coying here for quick referance:
There's a few differences.
The simplest to see is the verbosity - terms queries just list an
array while term queries require more JSON.
terms queries do not score matches based on IDF (the rareness) of
matched terms - the term query does.
term queries can only have up to 1024 values due to Boolean's max
clause count
terms queries can have more terms
By default, Elasticsearch limits the terms query to a maximum of
65,536 terms. You can change this limit using the
index.max_terms_count setting.
Which of them is going to be faster? Is speed also related to the
number of terms?
It depends. They execute differently. term queries do more expensive scoring but does so lazily. They may "skip" over docs during execution because other more selective criteria may advance the stream of matching docs considered.
The terms queries doesn't do expensive scoring but is more eager and creates the equivalent of a single bitset with a one or zero for every doc by ORing all the potential matching docs up front. Many terms can share the same bitset which is what provides the scalability in term numbers.

What is the difference between `constant_score + filter` and `term` query?

I have two queries in Elasticsearch:
{
"term" : {
"price" : 20
}
}
and
"constant_score" : {
"filter" : {
"term" : {
"price" : 20
}
}
}
They are returning the same query result. I wonder what the main difference between them. I read some articles about scoring document. And I believe both queries are scoring document. The constant_score will use default score 1.0 to match the document's score. So I don't see much difference between these two.
The results would be exactly the exact.
However, the biggest difference is that the constant_score/filter version will cache the results of the term query since it's run in a filter context. All future executions will leverage that cache. Also, one feature of the constant_score query is that the returned score is always equal to the given boost value (which defaults to 1)
The first query will be run outside of the filter context and hence not benefit from the filter cache.

How to boost individual words in a elasticsearch match query

Suppose I want to query "Best holiday places to visit during summer" in a Elasticsearch cluster. But I want holiday, visit and summer to have high priority than other words:
Something Like this: Best holiday^4 places to visit^3 during summer^2.
I know about field boosting but what I want to do is not achievable by boost.
Basically I want to boost individual words.
Does any one have any idea about doing this in Elasticsearch 5.6 above??
You could use query_string to boost individual terms like this:
{
"query" : {
"query_string" : {
"fields" : ["content", "name"],
"query" : "Best holiday^4 places to visit^3 during summer^2"
}
}
}

How to normalize ElasticSearch scores?

For my project I need to find out which results of the searches are considered "good" matches. Currently, the scores vary wildly depending on the query, hence the need to normalize them somehow. Normalizing the scores would allow to select the results above a given threshold.
I found couple solutions for Lucene:
how do I normalise a solr/lucene score?
http://wiki.apache.org/lucene-java/ScoresAsPercentages
How would I go ahead and apply the same technique to ElasticSearch? Or perhaps there is already a solution that works with ES for score normalization?
As far as I searched, there is no way to get a normalized score out of elastic. You will have to hack it by making two queries. First will be a pilot query (preferably with size 1, but rest all attributes same) and it will fetch you the max_score. Then you can shoot your actual query and use functional_score to normalize the score. Pass the max_score you got as part of the pilot query in params to function_score and use it to normalize every score. Refer: This article snippet
It's a bit late.
We needed to normalise the ES score for one of our use cases. So, we wrote a plugin that overrides the ES Rescorer feature.
Supports min-max and z score.
Github: https://github.com/bkatwal/elasticsearch-score-normalizer
Usage:
Min-max
{
"query": {
... some query
},
"from" : 0,
"size" : 50,
"rescore" : {
"score_normalizer" : {
"normalizer_type" : "min_max",
"min_score" : 1,
"max_score" : 10
}
}
}
Usage z-score:
"query": {
... some query
},
"from" : 0,
"size" : 50,
"rescore" : {
"score_normalizer" : {
"normalizer_type" : "z_score",
"min_score" : 1,
"factor" : 0.6,
"factor_mode" : "increase_by_percent"
}
}
}
For complete documentation check the Github repository.

How to enable fuzziness for phrase queries in ElasticSearch

We're using ElasticSearch for searching through millions of tags. Our users should be able to include boolean operators (+, -, "xy", AND, OR, brackets). If no hits are returned, we fall back to a spelling suggestion provided by ES and search again. That's our query:
$ curl -XGET 'http://127.0.0.1:9200/my_index/my_type/_search' -d '
{
"query" : {
"query_string" : {
"query" : "some test query +bools -included",
"default_operator" : "AND"
}
},
"suggest" : {
"text" : "some test query +bools -included",
"simple_phrase" : {
"phrase" : {
"field" : "my_tags_field",
"size" : 1
}
}
}
}
Instead of only providing a fallback to spelling suggestions, we'd like to enable fuzzy matching. If, for example, a user searches for "stackoverfolw", ES should return matches for "stackoverflow".
Additional question: What's the better performing method for "correcting" spelling errors? As it is now, we have to perform two subsequent requests, first with the original search term, then with the by ES suggested term.
The query_string does support some fuzziness but only when using the ~ operator, which I think doesn't your usecase. I would add a fuzzy query then and put it in or with the existing query_string. For instance you can use a bool query and add the fuzzy query as a should clause, keeping the original query_string as a must clause.
As for your additional question about how to correct spelling mistakes: I would use fuzzy queries to automatically correct them and two subsequent requests if you want the user to select the right correction from a list (e.g. Did you mean), but your approach sounds good too.

Resources