ElasticSearch: how can i influence the "directionality" of a trigram match? - elasticsearch

we use elasticsearch to search on address data and for the purpose of non-exact matches we include a field variant of the streetname that is analyzed with an ngram tokenizer (trigrams to be specific). and we use a minimum-should-match clause of "3<75%" for the queries on this field, which means 'if there are 3 or less trigrams in the search term then all of them have to match. If there are more than 3, then 75% of them have to match'
generally this works OK, but there are cases where we get unintended results like this
We search for "Uhland" and we find "Am Maschlandgraben". As far as i can tell what happens is that "Uhland" is split into "uhl", "hla", "lan", "and" and 3 of those 4 trigrams can be matched to the trigrams of "Am MascHLANDgraben" (the matching part in upper case). so, 3 out of 4 is 75% that fulfills our "3<75%" requirement, so it becomes a match.
So there is a "directionality" (for lack of a better word) for that 75% match. it only looks at/counts against the number of terms in the search term and ignores how many trigrams of the indexed document are not matched.
One could argue that the 75% match requirement is not met in that example, because 10 out of the 13 trigrams from "Am Maschlandgraben" are not matched by the trigrams of "Uhland". And in fact, if you reverse the query and search for "Am Maschlandgraben" you won't find "Uhland" as a match. Because now the "directionality" is reversed and the query realizes that only 3 out of 13 trigrams are matched and that does not meet the requirement of "3<75%"
what i would love to figure out is how i can modify the query so that the 75% match has no "directionality" and always has to match on "both sides" of the comparison. so to stay with the example above, i neither want "Uhland" to be a match to "Am Maschlandgraben" nor "Am Maschlandgraben" a match to "Uhland"
So i guess, to put it in real life language, instead of "75% of the search term trigrams need to match the indexed document" i would like to have "75% of both search term and indexed document trigrams need to match"
i hope i communicated well enough what my intention is (english is not my native language)
Here is an example of how our query looks right now_
{
"query": {
"bool": {
"should": [
{
"match": {
"address.street.trigram": {
"query": "Uhland",
"minimum_should_match": "3<75%"
}
}
}
]
}
}
}

Related

Using NOT operator to exclude compound words in elasticsearch

I've a problem with the NOT operator in Elasticsearch.
I start a query string query and look for these keywords:
plain~ AND NOT port~
I'm getting a list with documents which contains the word "plain" (that's ok) but also with the word "airport".
Is this the correct behavior and how can I exclude these compound words?
Yes, this is corect behaviour. Please have a look at the documentation for fuzzy operator and especially fuzziness parameter values.
The point is that fuzzy operator "uses the Damerau-Levenshtein distance to find all terms with a maximum of two changes, where a change is the insertion, deletion or substitution of a single character..."
The word airport in you query is not excluded as it has more than two changes.
But this query would work:
{
"query": {
"query_string": {
"fields" : ["description"],
"query": "NOT rport~2"
}
}
}
It would exclude airport from the results. But you cannot increase the fuzziness factor to 3 as this is not supported (so this "query": "NOT port~3" won't work).
Your needs sound to me more like one of the cases of Partial Matching

Match string with minus character in elasticsearch

So in DB I have this entry:
Mark-Whalberg
When searching with term
Mark-Whalberg
I get not match.
Why? Is minus a special character what I understand? It symbolizes "exclude"?
The query is this:
{"query_string": {"query": 'Mark-Whalberg', "default_operator": "AND"}}
Searching everything else, like:
Mark
Whalberg
hlb
Mark Whalberg
returns a match.
Is this stored as two different pieces? How can I get a match when including the minus sign in the search term?
--------------EDIT--------------
This is the current query:
var fields = [
"field1",
"field2",
];
{"query_string":{"query": '*Mark-Whalberg*',"default_operator": "AND","fields": fields}};
You have an analyzer configuration issue.
Let me explain that. When you defined your index in ElasticSearch, you didn't indicate any analyzer for the field. It means it's the Standard Analyzer that will apply.
According to the documentation :
Standard Analyzer
The standard analyzer is the default analyzer which is used if none is
specified. It provides grammar based tokenization (based on the
Unicode Text Segmentation algorithm, as specified in Unicode Standard
Annex #29) and works well for most languages.
Also, to answer to your question :
Why? Is minus a special character what I understand? It symbolizes
"exclude"?
For the Standard Analyzer, yes it is. It doesn't mean "exclude" but it is a special char that will be deleted after analysis.
From documentation :
Why doesn’t the term query match my document?
[...] There are many ways to analyze text: the default standard
analyzer drops most punctuation, breaks up text into individual words,
and lower cases them. For instance, the standard analyzer would turn
the string “Quick Brown Fox!” into the terms [quick, brown, fox].
[...]
Example :
If you have the following text :
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
Then the Standard Analyzer will produce :
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
If you don't want to use the analyzer you have 2 solutions :
You can use match query.
You can ask ElasticSearch to not analyze the field when you create your index : here's how
I hope this will help you.
I've stuck in same question and the answer from #Mickael was perfect to understand what is going on (I really recommend you to read the linked documentation).
I solve this by defining an operator to the query:
GET http://localhost:9200/creative/_search
{
"query": {
"match": {
"keyword_id": {
"query": "fake-keyword-uuid-3",
"operator": "AND"
}
}
}
}
For better understand the algorithm that this query uses, try to add "explain": true and analyse the results:
GET http://localhost:9200/creative/_search
{
"explain": true,
"query": // ...
}

Fuzziness on 3 letter words

What kind of analyzers would you implement in Elasticsearch for searching book titles.
The requirements are that there must be fuzziness and there are word that are 3 letters.
I'm not going to include code because I would like to get a fresh insight.
But the problem I am having is that when I search 3 letters words wrong,
Say I type "dns" and there is a document with a field "dna" then I will get
kindness or something that has dns in the word.
I believe to solve your problem you can use the fuzziness field in your fuzzy query, this will let you set the maximum edit distance so long words will not get matched when your input is a very small word.
{
"fuzzy" : {
"user" : {
"value" : "ki",
"fuzziness" : 2,
"prefix_length" : 1
}
}
}
The above query would match all 3 letter words which start with the letter 'k' and all 4 letter words which start with the letters 'ki'. A fuzziness of 2 means that any 2 edits are allowed i.e. either change 'i' to another letter and then add another letter or add two more letter while keeping 'ki'. The prefix length tells elasticsearch how much of the query needs to be exactly matched before the fuzziness can take over.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html

Elasticsearch term query with colons

I have a string field "title"(not analyzed) in elasticsearch. A document has title "Garfield 2: A Tail Of Two Kitties (2006)".
When I use the following json to query, no result returns.
{"query":{"term":{"title":"Garfield 2: A Tail Of Two Kitties (2006)"}}}
I tried to escape the colon character and the braces, like:
{"query":{"term":{"title":"Garfield 2\\: A Tail Of Two Kitties \\(2006\\)"}}}
Still not working.
Term query wont tokenize or apply analyzers to the search text. Instead if looks for the exact match which wont work as the string fields are analyzed/tokenized by default.
To give this a better explanation -
Lets say there is a string value as - "I am in summer:camp"
When indexing this its broken into tokens as below -
"I am in summer:camp" => [ I , am , in , summer , camp ]
Hence even if you do a term search for "I am in summer:camp" , it wont still work as the token "I am in summer:camp" is not present in the index.
Something like phrase query might work better here.
Or you can leave "index" field as "not_analyzed" to make sure that string is not tokenized.

Elasticsearch substring matchng without ending

For example if my search word is: "Houses" I want found result "House" how to search without last 1-2 word letters ?
I try "nGram" filter, but it serrch for full word.
I feel you are chasing the wrong approach.
Judging by your example , i feel what you are looking is stemmers.
Elasticsearch has stemmers like snowball which can convert any word to their base forms or stems.
For eg: , the stemmer can convert
[ "jumping" , "jumped" ] -> "jump"
[ "staying" , "stayed" ] -> "stay"
And so on...
Snowball - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-snowball-analyzer.html#analysis-snowball-analyzer

Resources