Matches on different words should score higher then multiple matches on one word in elasticsearch - elasticsearch

In our elasticsearch we have indexed some persons where each person can have multiple taggings.
Take for example 2 persons (fullname - (taggings)):
Bart Newman - (bart,engineer,ceo)
Bart Holland - (developer,employer)
Our searchquery
{
"multi_match": {
"type": "most_fields",
"query": "bart developer",
"operator": "or",
"boost": 5,
"fields": [
"fullname^5",
"taggings.tag.name^5"
],
"fuzziness": 0
}
}
Let's say we are searching on "bart developer". Then we should expect that Bart Holland should come before Bart Newman, but because Bart Newman has bart in his fullname and bart as tag, he scores higher then Bart Holland does.
Is there a way where I can configure that matches on different words (bart, developer) can score higher then multiple matches on one word (bart).
I already tried the and-operator without success.
Thanks!

This is kind of expected with most fields query, it is field-centric rather than term-centric, From the Docs
most_fields being field-centric rather than term-centric: it looks for
the most matching fields, when really what we’re interested is the
most matching terms.
Another problem is Inverse Document Frequency which is also likely in your case. I guess only few documents have tag named bart which is why its IDF is very high and hence gets higher score.
As given in the above links, you should see how documents are scored with validate and explain.
There are couple of ways to solve this issue
1) You can use custom _all field, i.e copy both full name and tag information to new field with copy_to parameter and then query on it but you have to reindex your data for that
2) I think better solution would be to use cross fields, it takes term-centric approach. From the Docs
The cross_fields type first analyzes the query string to produce a
list of terms, and then it searches for each term in any field.
It also solves IDF issue by blending it across all fields.
This should solve your issue.
{
"query": {
"multi_match": {
"type": "cross_fields",
"query": "bart developer",
"operator": "or",
"fields": [
"fullname",
"tagging.tag.name"
],
"fuzziness": 0
}
}
}
Hope this helps!

Related

Which analyzer is used while using fuzzy operator with query_string clause?

Suppose I have a query clause like,
{
"query":
{
"query_string": {
"query": "ads spark~",
"fields": [
"flowName",
"projectName"
],
"default_operator": "and"
}
}
}
For this the explain output is:
"explanation": "+(projectName:ads | flowName:ads) +(projectName:spark~1 | flowName:spark~1)"
Whereas if I remove the fuzzy operator from query. Updated query clause below,
{
"query":
{
"query_string": {
"query": "ads spark",
"fields": [
"flowName",
"projectName"
],
"default_operator": "and"
}
}
}
I get a different explain output,
"explanation": "(projectName:ads spark | flowName:ads spark)"
Any idea why the tokens generated as different in both cases?
When you use fuzzy queries the way the query is parsed and constructed in Lucene differs from the normal behavior.
The one you see with the explanation is the Lucene query built from the query text.
When using fuzziness most of the text analysis is not done, only the filters that work on a per-character basis are allowed, as you can read in the documentation [1][2].
In this first case, since you are using fuzziness, the query text is split by whitespaces. Then, for each term a mandatory clause is built (the AND operator states that each term MUST appear in the document). You can call this a "term centric" query. Then each term is searched across the multiple fields in input with a disjunction (|) clause.
You therefore see "ads MUST be in projectName OR flowName, AND spark (with variations within the Levenshtein_distance) MUST be in projectName OR flowName".
In the second case, no fuzziness is used. Here the query is passed to each field and then the terms will follow the corresponding field text analysis (if any). You may call this a "field centric" query. Therefore you see "ads spark MUST be in projectName OR flowName" to have a document match.
You are effectively moving from an "I want all the terms to appear in the document" (it could be in different fields) to "I want all terms to appear in a single field".
If you want an in-depth analysis you can read this blog post https://sease.io/2021/05/apache-solr-sow-parameter-split-on-whitespace-and-multi-field-full-text-search.html. This is relative to Solr but Elasticsearch applies the same behavior.

Elasticsearch query multi_match

I'm trying to create an elasticsearch query that looks for multiple fields. This works fine so far. However, I would like to refine this.
Let's say the word was indexed: "test". However, when I search for "tes" he does not find that word for me, but I would like to show it already - but the combination with my query brings me to a challenge.
{
"multi_match" : {
"query": "*" + query + "*",
"type": "cross_fields",
"operator": "and",
"fields": ["article.number^1","article.name_de^1", "article.name_en^5", "article.name_fr^5", "article.description^1"],
"tie_breaker": 0,
}
Depending on your constraints, here are your options.
If you wish to use wildcard before/after your search term, you can use wildcard query. This has high processing cost at query time.
If you are fine with additional storage cost, you can opt to tokenize your input during analysis process. See ngram tokenizer. Beware that if you have long strings, it can quickly explode the storage requirement.

Fuzzy Matching Fails But Exact Match Passes

I've been constructing an ElasticSearch query using Fuzzy Matching to match a user in the system. When running it against a specific group of users (ones with my name), the query appears to work perfectly, but when running it against a random selection of users, it appears to fail.
For the purposes of my testing, I'm passing in the exact values of a specific user, so I would expect at least 1 match.
In narrowing this down, I found that an exact match against a name returns the data as expected, but putting the same value into a fuzzy block causes it to return 0 results.
For Instance, this query returns a user record as expected:
{
"from": 0,
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"firstName": {
"query": "sVxGBCkPYZ",
"boost": 30
}
}
}
],
"should": [
]
}
},
"fields": [
"id",
"firstName"
]
}
However replacing the match element with the below fails to return any records:
{
"fuzzy": {
"firstName": {
"value": "sVxGBCkPYZ",
"fuzziness": 2,
"boost": 30,
"min_similarity": 0.3
}
}
}
Why would this be happening, and is there anything I can do to remedy the situation?
For reference. This is the ES version i'm currently using:
"version": {
"number": "1.7.1",
"build_hash": "b88f43fc40b0bcd7f173a1f9ee2e97816de80b19",
"build_timestamp": "2015-07-29T09:54:16Z",
"build_snapshot": false,
"lucene_version": "4.10.4"
}
The match fails because fuzzy searches are term level queries meaning the query string would not be analysed while the data that got indexed, I assume, if of type text with standard analyzer, would be converted to svxgbckpyz in the inverted index.
You can instead, implement fuzziness with match query as below:
POST testindex/_search
{
"query":{
"match":{
"firstname":{
"query":"sVxGBCkPYZ",
"fuzziness":"AUTO"
}
}
}
}
You can change the value from AUTO to 2 or 3 depending on your use case.
The exact match you mentioned also works because query string would get analysed and converts the input string into lower case, which is available in inverted index.
As for how fuzzy query (that you've mentioned) works behind the scene, as per this LINK, is as follows:
The fuzzy query works by taking the original term and building a
Levenshtein automaton—like a big graph representing all the strings
that are within the specified edit distance of the original string.
The fuzzy query then uses the automaton to step efficiently through
all of the terms in the term dictionary to see if they match. Once it
has collected all of the matching terms that exist in the term
dictionary, it can compute the list of matching documents.
Of course, depending on the type of data stored in the index, a fuzzy
query with an edit distance of 2 can match a very large number of
terms and perform very badly.
Note this statement in particular, representing all the strings that are within the specified edit distance of the original string
For e.g. some of the words with distance of 1 for life would be aife, bife, cife, dife....lifz.
So in your case, fuzzy search's automaton would not be able to create term svxgbckpyz from input string sVxGBCkPYZ firstly because the distance between them is 7 (Remember distance is 1 between A and a) which I don't think AUTO option can create and even if you configure it to 7, it may not create the string as there would be huge list of words with distance 7
Adding one more LINK for more info. Hope it helps!

elasticsearch: or operator, number of matches

Is it possible to score my searches according to the number of matches when using operator "or"?
Currently query looks like this:
"query": {
"function_score": {
"query": {
"match": {
"tags.eng": {
"query": "apples banana juice",
"operator": "or",
"fuzziness": "AUTO"
}
}
},
"script_score": {
"script": # TODO
},
"boost_mode": "replace"
}
}
I don't want to use "and" operator, since I want documents containing "apple juice" to be found, as well as documents containing only "juice", etc. However a document containing the three words should score more than documents containing two words or a single word, and so on.
I found a possible solution here https://github.com/elastic/elasticsearch/issues/13806
which uses bool queries. However I don't know how to access the tokens (in this example: apples, banana, juice) generated by the analyzer.
Any help?
Based on the discussions above I came up with the following solution, which is a bit different that I imagined when I asked the question, but works for my case.
First of all I defined a new similarity:
"settings": {
"similarity": {
"boost_similarity": {
"type": "scripted",
"script": {
"source": "return 1;"
}
}
}
...
}
Then I had the following problem:
a query for "apple banana juice" had the same score for a doc with tags ["apple juice", "apple"] and another doc with tag ["banana", "apple juice"]. Although I would like to score the second one higher.
From the this other discussion I found out that this issue was caused because I had a nested field. And I created a usual text field to address it.
But I also was wanted to distinguish between a doc with tags ["apple", "banana", "juice"] and another doc with tag ["apple banana juice"] (all three words in the same tag). The final solution was therefore to keep both fields (a nested and a text field) for my tags.
Finally the query consists of bool query with two should clauses: the first should clause is performed on the text field and uses an "or" operator. The second should clause is performed on the nested field and uses and "and operator"
Despite I found a solution for this specific issue, I still face a few other problems when using ES to search for tagged documents. The examples in the documentation seem to work very well when searching for full texts. But does someone know where I can find something more specific to tagged documents?

Boosting in Elasticsearch

I am new to elasticsearch. In elasticsearch we can use the term boost in almost all queries. I understand it's used for modify score of documents. But i can't find actual use of it. My query is if i use boost values in some queries, will it affect final score of search or the boost rank of docs in index itself.
And what is main difference between boost at index and boost at querying..
Thanks in Advance..!
Query time boost allows you to give more weight to one query than to another. For instance, let's say you are querying the title and body fields for "Quick Brown Fox", you could write it as:
{
"query": {
"bool": {
"should": [
{
"match": {
"title": "Quick Brown Fox"
}
},
{
"match": {
"body": "Quick Brown Fox"
}
}
]
}
}
}
But you decide that you want the title field to be more important than the body field, which means you need to boost the query on the title field by (eg) 2:
{
"query": {
"bool": {
"should": [
{
"match": {
"title": {
"query": "Quick Brown Fox",
"boost": 2
}
}
},
{
"match": {
"body": "Quick Brown Fox"
}
}
]
}
}
}
(Note how the structure of the match clause changed to accommodate the boost parameter).
The boost value of 2 doesn't double the _score exactly - the scores go through a normalization process. So you should think of boost as make this query clause relatively more important than the other query clauses.
My doubt is if i use boost values in some queries. will it affect final score of search
Yes it does, but you shouldn't rely on the actual value of _score anyway. Its only purpose is to allow Elasticsearch to decide which documents are most relevant to this query. If the query changes, the scores change.
Re index time boosting: don't use it. It's inflexible and error prone.
Boost at query time won't modify your index. It only applies boost factor on fields when searching.
I prefer boost at query time as it's more flexible. If you need to change your boost rules and you had set it at index time, you will probably need to reindex.
Use cases of boosting : Suppose you are building a e-commerce web app, and your product data is in elastic search. Whenever a customer uses search bar you query elastic search and displays the result in web app.
Elastic search keeps relevance score for every document and returns the result in sorted order of the relevance score.
Now let's assume a user searches for "samsung phones", then should your web app just show samsung phones -> Answer is NO.
Your web app should show other phones as well (as user may like those as well) but first show samsung phones (as he/she is looking for those) and then show other phones as well.
So question is how do you query where samsung phones comes up in result ? -> Answer is relevance score.
Let say you hit query like for all mobile phones and samsung phone and the keep high relevance score of samsung phones,
Then result will contain first samsung phones and then other phones.

Resources