Fuzzy Matching Fails But Exact Match Passes - elasticsearch

I've been constructing an ElasticSearch query using Fuzzy Matching to match a user in the system. When running it against a specific group of users (ones with my name), the query appears to work perfectly, but when running it against a random selection of users, it appears to fail.
For the purposes of my testing, I'm passing in the exact values of a specific user, so I would expect at least 1 match.
In narrowing this down, I found that an exact match against a name returns the data as expected, but putting the same value into a fuzzy block causes it to return 0 results.
For Instance, this query returns a user record as expected:
{
"from": 0,
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"firstName": {
"query": "sVxGBCkPYZ",
"boost": 30
}
}
}
],
"should": [
]
}
},
"fields": [
"id",
"firstName"
]
}
However replacing the match element with the below fails to return any records:
{
"fuzzy": {
"firstName": {
"value": "sVxGBCkPYZ",
"fuzziness": 2,
"boost": 30,
"min_similarity": 0.3
}
}
}
Why would this be happening, and is there anything I can do to remedy the situation?
For reference. This is the ES version i'm currently using:
"version": {
"number": "1.7.1",
"build_hash": "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
"build_timestamp": "2015-07-29T09:54:16Z",
"build_snapshot": false,
"lucene_version": "4.10.4"
}

The match fails because fuzzy searches are term level queries meaning the query string would not be analysed while the data that got indexed, I assume, if of type text with standard analyzer, would be converted to svxgbckpyz in the inverted index.
You can instead, implement fuzziness with match query as below:
POST testindex/_search
{
"query":{
"match":{
"firstname":{
"query":"sVxGBCkPYZ",
"fuzziness":"AUTO"
}
}
}
}
You can change the value from AUTO to 2 or 3 depending on your use case.
The exact match you mentioned also works because query string would get analysed and converts the input string into lower case, which is available in inverted index.
As for how fuzzy query (that you've mentioned) works behind the scene, as per this LINK, is as follows:
The fuzzy query works by taking the original term and building a
Levenshtein automaton—like a big graph representing all the strings
that are within the specified edit distance of the original string.
The fuzzy query then uses the automaton to step efficiently through
all of the terms in the term dictionary to see if they match. Once it
has collected all of the matching terms that exist in the term
dictionary, it can compute the list of matching documents.
Of course, depending on the type of data stored in the index, a fuzzy
query with an edit distance of 2 can match a very large number of
terms and perform very badly.
Note this statement in particular, representing all the strings that are within the specified edit distance of the original string
For e.g. some of the words with distance of 1 for life would be aife, bife, cife, dife....lifz.
So in your case, fuzzy search's automaton would not be able to create term svxgbckpyz from input string sVxGBCkPYZ firstly because the distance between them is 7 (Remember distance is 1 between A and a) which I don't think AUTO option can create and even if you configure it to 7, it may not create the string as there would be huge list of words with distance 7
Adding one more LINK for more info. Hope it helps!

Related

How can I improve/make stronger text fuzzy searching in Elasticsearch?

Below is my setup. I am inserting a user in ElasticSearch and I am doing weighted fuzziness username searches. The problem is that the fuzziness could be... fuzzier? I show you what I mean, this code is my mapping:
{
"mappings": {
"properties": {
"user_id": {
"enabled": false
},
"username": {
"type": "text"
},
"d_likes": {
"type": "rank_feature"
}
}
}
}
And I am inserting 2 users:
user_id: random, username: pietje, d_likes: 3
user_id: random, username: p13tje, d_likes: 30
Now the problem is that I need to write a lot of characters in the username field to get hits. This is how I search:
{
"query": {
"bool": {
"must": [
{
"match": {
"username": {
"query": "piet",
"fuzziness": "auto"
}
}
}
],
"should": [
{
"rank_feature": {
"field": "d_likes"
}
}
]
}
}
}
'piet' gives no results. That looks strange to me, I was hoping I would actually see both p13tje and pietje (in that order) because they are so similar. When my search query is pietj, I only get pietje and not p13tje.
So I was wondering how can I get more hits with the fuzziness search? I want autocompletion for usernames, this is pretty bad user expierence, because it only gives autocompletion when you have filled in most the characters. I just want the search to be more loose and give more results.
ElasticSearch documentation:
When querying text or keyword fields, fuzziness is interpreted as a Levenshtein Edit Distance — the number of one character changes that need to be made to one string to make it the same as another string.
The Levenshtein Edit Distance essentially is a way of measuring the difference between 2 string values.
You've set the fuzziness parameter to AUTO, which is a great default decision. However, for some short strings like yours, it can prove to be not as fuzzy as you'd want it to be.
This is because ElasticSearch (ES) will generate an edit distance based on the length of the string, which will determine how many edits away the string in the index is from your search query.
You haven't specified any specific low or high values so for piet, as it's a 4 character string, only one edit will be allowed.
pietje is actually two edits away - piet needs a j as well as an e so it won't show up.
p13tje is actually four edits away - it needs a j, an e, a change from 1 to i & a change from 3 to e so it also won't show up.
The maximum allowed Levenshtein Edit Distance for ES fuzzy searching is 2 (larger differences are far more expensive to compute efficiently and are not processed by the Lucene search engine which ES is based on) so to fix this, set fuzziness to 2 manually.
"match": {
"username": {
"query": "piet",
"fuzziness": "2"
}
}
Hopefully, that will at least allow pietje to show up in the search and possibly even p13tje depending on if there are any other matches or not.
Instead of manually setting it to 2, you could also set the low and high distance arguments for AUTO however that will generate worse results (format is AUTO:[low],[high] e.g. AUTO:15,30).
For example, with a low of 8 and a high of 20:
Usernames with a character length of 8 or lower will not have any fuzzy searching as it will have to be an exact match
Usernames with a character length between 9 & 20 will only be allowed 1 edit
Usernames with a character length of 21 or higher will only be allowed 2 edit
You can try tweaking the low and high values if you'd like, but for the... fuzziest fuzziness, set the edit distance to the maximum allowed Levenshtein edit distance (2).

Is query context evaluated before filter context in elasticsearch? How to determine the order of evaluation?

I am using the below query :
GET customer/doc/_search?routing=123
{
"query": {
"bool": {
"filter": [
{
"term": {
"location": "Delhi"
}
}
],
"should": [
{
"match_phrase_prefix": {
"phone": {
"query": "650",
"max_expansions": 100
}
}
}
]
}
}
}
The problem is my search on phone isn't working anymore. It used to work fine when I had less data, now every shard has data for multiple locations. Search on phone now requires me to type in 6 or 7 characters at times. (There may be matching phone numbers that have different location but are on this shard)
This is due to max_expansions I am guessing. When I increase it to 500 it does return me search results (not all), but the query becomes slow.
Isn't there a way to force es to apply filter first (and restrict the dataset) and then apply the should clause, so that I get the matching results even with small value of max_expansions?
Any help is appreciated.
It is due to max_expansions. Restricting dataset is not exactly what you may want to do ( Thats also not very straight forward - you may have to use some script which will in turn slowdown query).
When you query for a wildcard expression, Lucene expands the wildcard expression into set of actual terms in your inverted index term dictionary. Now , when you restrict the term expansion to 500 - it might miss a few.
I would consider using prefixes during indexing phase. Prefixes helps to avoid the costly expansion in runtime phase.

What is the difference between must and filter in Query DSL in elasticsearch?

I am new to elastic search and I am confused between must and filter. I want to perform an and operation between my terms, so I did this
POST /xyz/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"city": "city1"
}
},
{
"term": {
"saleType": "sale_type1"
}
}
]
}
}
}
which gave me the required results matching both the terms, and on using filter like this
POST /xyz/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"city": "city1"
}
}
],
"filter": {
"term": {
"saleType": "sale_type1"
}
}
}
}
}
I get the same result, so when should I use must and when should I use filter? What is the difference?
must contributes to the score. In filter, the score of the query is ignored.
In both must and filter, the clause(query) must appear in matching documents. This is the reason for getting same results.
You may check this link
Score
The relevance score of each document is represented by a positive floating-point number called the _score. The higher the _score, the more relevant the document.
A query clause generates a _score for each document.
To know how score is calculated, refer this link
must returns a score for every matching document. This score helps you rank the matching documents, and compare the relative relevance between documents (using the magnitude of the score of each document).
With this, one can say, Doc 1 is how many times more relevant than Doc 2. Or that Doc 1 to 7 are of much higher relevancy than Doc 8+.
For how the relative score is determined, you can refer to the references below.
Briefly, it is related to the number of term occurrences in the document, the document length, and the average number of term occurrences in your database index.
filter doesn't return a score. All one can say is, all matching documents are of relevance. But it won't help in evaluating if one is more relevant than the other. You can think of filter as a must with only 2 scores: zero or non-zero, and where all zero-scored documents are dropped.
filter is helpful if you just want to whitelist/blacklist for e.g., all documents belonging to the topic "pets".
In summary, there are 3 points that will help you in deciding when to use what:
must is your only choice when comparing/ranking documents by relevance
filter excludes all documents that don't match
filter is a lot faster because Elasticsearch doesn't need to compute the relative score
References:
Query vs Filter: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html
Computation of Relevance: https://www.infoq.com/articles/similarity-scoring-elasticsearch/

Elasticsearch, sorting by exact string match

I want to sort results, such that if one specific field (let's say 'first_name') is equal to an exact value (let's say 'Bob'), then those documents are returned first.
That would result in all documents where first_name is exactly 'Bob', would be returned first, and then all the other documents afterwards. Note that I don't intend to exclude documents where first_name is not 'Bob', merely sort them such that they're returned after all the Bobs.
I understand how numeric or alphabetical sorting works in Elasticsearch, but I can't find any part of the documentation covering this type of sorting.
Is this possible, and if so, how?
One solution is to manipulate the score of the results that contain the Bob in the first name field.
For example:
POST /test/users
{
"name": "Bob"
}
POST /test/users
{
"name": "Alice"
}
GET /test/users/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "Bob",
"boost" : 2
}
}
},
{
"match_all": {}
}
]
}
}
}
Would return both Bob and Alice in that order (with approximate scores of 1 and 0.2 respectively).
From the book:
Query-time boosting is the main tool that you can use to tune
relevance. Any type of query accepts a boost parameter. Setting a
boost of 2 doesn’t simply double the final _score; the actual boost
value that is applied goes through normalization and some internal
optimization. However, it does imply that a clause with a boost of 2
is twice as important as a clause with a boost of 1.
Meaning that if you also wanted "Fred" to come ahead of Bob you could just boost it with a 3 factor in the example above.

Constant Score Query elasticsearch boosting

My understanding of Constant Score Query in elasticsearch is that boost factor would be assigned as score for every matching query. The documentation says:
A query that wraps a filter or another query and simply returns a constant score equal to the query boost for every document in the filter.
However when I send this query:
"query": {
"constant_score": {
"filter": {
"term": {
"source": "BBC"
}
},
"boost": 3
}
},
"fields": ["title", "source"]
all the matching documents are given a score of 1?! I cannot figure out what I am doing wrong, and had also tried with query instead of filter in constant_score.
Scores are only meant to be relative to all other scores in a given result set, so a result set where everything has the score of 3 is the same as a result set where everything has the score of 1.
Really, the only purpose of the relevance _score is to sort the results of the current query in the correct order. You should not try to compare the relevance scores from different queries. - Elasticsearch Guide
Either the constant score is being ignored because it's not being combined with another query or it's being normalized. As #keety said, check to the output of explain to see exactly what's going on.
Constant score query gives equal score to any matching document irrespective any scoring factors like TF, IDF etc. This can be used when you don't care whether how much a doc matched but just if a doc matched or not and give a score too, unlike filter.
If you want score as 3 literally for all the matching documents for a particular query, then you should be using function score query, something like
"query": {
"function_score": {
"functions": [
{
"filter": { "term": { "source": "BBC" } },
"weight": 3
}
]
}
...
}

Resources