Elastic Search fuzzy search not finding desired results - elasticsearch

In elastic search index i have trademark data.I have trademarks named HILTON in database, but when i search hillytown its not finding hilton, but when searched hilytown it finds. How can i modify elsatic search to find hilton when i search hillytown
note: : hillytown has two l
The search that i tried is
$param = '
{
"query": {
"fuzzy" : {
"trademark" : {
"value": "'.$keyword.'",
"boost": 1.0,
"fuzziness": "AUTO",
"prefix_length": 0,
"max_expansions": 100,
"transpositions":true
}
}
}
}';
in
http://localhost:9200/watch_index_write/_search

I don't think you can get "hillytown" to match "hilton". Elasticsearch allows a maximum fuzziness (levenshtein distance) of 2, but hillytown has a distance of 3 (need to remove 3 letters).
I haven't tried it myself, but the phonetic analysis plugin might provide a way forward.

Related

How to give more weight-age to specific keywords while searching for similar text using elasticsearch?

I am using elasticsearch to get relevant blog articles from a database of articles. I want results that contain particular words to be given higher score than the search results who do not have them.
I have tried adding stop words and given more to other fields but the results are not quite as expected. I am using developer mode of the Kibana interface of elasticsearch
"""
GET blog-desc/_search
{
"query": {
"more_like_this" : {
"fields" : ["Meta description","Title^5",
"Short title^0.5"],
"like" : "Harry had a silver wand he likes to play with! Among his friends he has the most expensive one. The only difference between his wand and his sister's is that in the color",
"min_term_freq" : 1,
"max_query_terms" : 12,
"minimum_should_match": "30%",
"stop_words": ["difference", "play", "among"]
, "boost_terms": 1
}
}
}
"""
In the sample code above, I would want search results having "silver" as a word in them given more score than other articles who do not that word.

Elasticsearch compare long sequence strings with fuzzy query

I have two long String sequences that are similar:
C50FD711C2C43287351892A4D82F44B055F048C46D2C54197AC1D1E921F11E6699C4057C4B93907518E6DCA51A672D3D3E419160DAE276CB7716D11B94D8C3BB2E4A591329B7AF973D17A7F9336342FFAAFD4D
and
C50FD711C2C43287351892A4D820B5EAC5F048C1E67CAC197AC1D1E921F11C3623C1DCD6493907518E6DCA18CD71016E7FD1160DAE276CB7716D11B94A6B762E4A591329B7AF973D17A7F9336342FFAAFD4D
Its distance is 41.
I would like to find those strings that are similar to eachother. I started a query like this:
GET my_index/_type/_search
{
"query": {
"fuzzy" : {
"sequence.keyword": {
"value": "C50FD711C2C43287351892A4D820B5EAC5F048C1E67CAC197AC1D1E921F11C3623C1DCD6493907518E6DCA18CD71016E7FD1160DAE276CB7716D11B94A6B762E4A591329B7AF973D17A7F9336342FFAAFD4D",
"boost": 1.0,
"fuzziness": 50,
"prefix_length": 10,
"max_expansions": 200
}
}
}
}
I tried with sequence.keyword and sequence, the field is of type text and type keyword.
However, it did not find the other similar sequence string in my index. Why?
The answer is pretty simple. The maximum edit distance that is allowed is 2 (as can be seen in the source code for the Fuzziness class
You can try with a simpler value, if you index AAAAAA and try to search for AAABBB with fuzziness: 3, you'll get nothing.

Fuzzy Matching Fails But Exact Match Passes

I've been constructing an ElasticSearch query using Fuzzy Matching to match a user in the system. When running it against a specific group of users (ones with my name), the query appears to work perfectly, but when running it against a random selection of users, it appears to fail.
For the purposes of my testing, I'm passing in the exact values of a specific user, so I would expect at least 1 match.
In narrowing this down, I found that an exact match against a name returns the data as expected, but putting the same value into a fuzzy block causes it to return 0 results.
For Instance, this query returns a user record as expected:
{
"from": 0,
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"firstName": {
"query": "sVxGBCkPYZ",
"boost": 30
}
}
}
],
"should": [
]
}
},
"fields": [
"id",
"firstName"
]
}
However replacing the match element with the below fails to return any records:
{
"fuzzy": {
"firstName": {
"value": "sVxGBCkPYZ",
"fuzziness": 2,
"boost": 30,
"min_similarity": 0.3
}
}
}
Why would this be happening, and is there anything I can do to remedy the situation?
For reference. This is the ES version i'm currently using:
"version": {
"number": "1.7.1",
"build_hash": "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
"build_timestamp": "2015-07-29T09:54:16Z",
"build_snapshot": false,
"lucene_version": "4.10.4"
}
The match fails because fuzzy searches are term level queries meaning the query string would not be analysed while the data that got indexed, I assume, if of type text with standard analyzer, would be converted to svxgbckpyz in the inverted index.
You can instead, implement fuzziness with match query as below:
POST testindex/_search
{
"query":{
"match":{
"firstname":{
"query":"sVxGBCkPYZ",
"fuzziness":"AUTO"
}
}
}
}
You can change the value from AUTO to 2 or 3 depending on your use case.
The exact match you mentioned also works because query string would get analysed and converts the input string into lower case, which is available in inverted index.
As for how fuzzy query (that you've mentioned) works behind the scene, as per this LINK, is as follows:
The fuzzy query works by taking the original term and building a
Levenshtein automaton—like a big graph representing all the strings
that are within the specified edit distance of the original string.
The fuzzy query then uses the automaton to step efficiently through
all of the terms in the term dictionary to see if they match. Once it
has collected all of the matching terms that exist in the term
dictionary, it can compute the list of matching documents.
Of course, depending on the type of data stored in the index, a fuzzy
query with an edit distance of 2 can match a very large number of
terms and perform very badly.
Note this statement in particular, representing all the strings that are within the specified edit distance of the original string
For e.g. some of the words with distance of 1 for life would be aife, bife, cife, dife....lifz.
So in your case, fuzzy search's automaton would not be able to create term svxgbckpyz from input string sVxGBCkPYZ firstly because the distance between them is 7 (Remember distance is 1 between A and a) which I don't think AUTO option can create and even if you configure it to 7, it may not create the string as there would be huge list of words with distance 7
Adding one more LINK for more info. Hope it helps!

How to normalize ElasticSearch scores?

For my project I need to find out which results of the searches are considered "good" matches. Currently, the scores vary wildly depending on the query, hence the need to normalize them somehow. Normalizing the scores would allow to select the results above a given threshold.
I found couple solutions for Lucene:
how do I normalise a solr/lucene score?
http://wiki.apache.org/lucene-java/ScoresAsPercentages
How would I go ahead and apply the same technique to ElasticSearch? Or perhaps there is already a solution that works with ES for score normalization?
As far as I searched, there is no way to get a normalized score out of elastic. You will have to hack it by making two queries. First will be a pilot query (preferably with size 1, but rest all attributes same) and it will fetch you the max_score. Then you can shoot your actual query and use functional_score to normalize the score. Pass the max_score you got as part of the pilot query in params to function_score and use it to normalize every score. Refer: This article snippet
It's a bit late.
We needed to normalise the ES score for one of our use cases. So, we wrote a plugin that overrides the ES Rescorer feature.
Supports min-max and z score.
Github: https://github.com/bkatwal/elasticsearch-score-normalizer
Usage:
Min-max
{
"query": {
... some query
},
"from" : 0,
"size" : 50,
"rescore" : {
"score_normalizer" : {
"normalizer_type" : "min_max",
"min_score" : 1,
"max_score" : 10
}
}
}
Usage z-score:
"query": {
... some query
},
"from" : 0,
"size" : 50,
"rescore" : {
"score_normalizer" : {
"normalizer_type" : "z_score",
"min_score" : 1,
"factor" : 0.6,
"factor_mode" : "increase_by_percent"
}
}
}
For complete documentation check the Github repository.

Boosting only results with a near-identical score in Elasticsearch

I'm using the following query to search through a database of names, allowing fuzzy matching but giving preference to exact matches.
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "x",
"operator": "and",
"boost": 10
}
}
},
{
"match": {
"name": {
"query": "x",
"fuzziness": "AUTO",
"operator": "and"
}
}
},
{
"match": {
"altname": {
"query": "x",
"fuzziness": "AUTO",
"operator": "and"
}
}
}
]
}
}
The database contains entries with identical names. If that happens, I would like to boost those entries by a second field, let's call it weight. However, I only want the boost to be applied between the subset of results with a (near) identical score, not to all of the results.
This is further complicated by the fact that results with an identical name may receive a slightly different score, as they are influenced by the relevancy on the altname field.
For example, querying for dog could give 3 results:
Dog [id 1, score 2.3, weight 10]
Dog [id 2, score 2.2, weight 20]
Doge [id 3, score 1, weight 100]
I'm looking for a query that would boost the result with id 2 to the top score. The result with id 3 should always stay at the bottom due to its poor relevancy, regardless of its weight. Ideally with tunable parameters to tweak the factor of the score vs. the factor of the weight.
Any way to do this in a single pass in Elasticsearch, of course without ruining performance?
Looks like I figured it out.
First, I realised that the example in my original question was more complex than necessary. I narrowed it down to: "How to compose a query for 'blub' that returns the following documents in the order 2, 3, 1"
id: 1
name: blub
weight: 0.01
---
id: 2
name: blub
weight: 0.1
---
id: 3
name: blub stuff
weight: 1
Thus: for the two documents with an identical (or very similar) score, the weight should be used as a tie-breaker. But documents with a significantly lower score should never be allowed to trump other results, regardless of their weight.
I loaded the data in the excellent Play tool: https://www.found.no/play/gist/edd93c69c015d4c62366#search and started experimenting.
Turned out the log2p modifier did exactly what I expected. Repeated it on a real-world dataset and everything looks exactly as expected.
function_score:
query:
match:
name: blub
field_value_factor:
field: weight
modifier: log2p

Resources