Elastic search with fuzziness more than 2 characters (Distance) - elasticsearch

I am trying to match text fields. I am expecting results if it has 60% plus matching.
by Fuzziness we can give only 2 distance. With this
Elastic Db has record with description 'theeventsfooddrinks' and i am trying to match 'theeventsfooddrinks123', This doesn't matches.
'theeventsfooddrinks12'=> matches
'theeventsfooddri'=> Doesn't matches
'321eventsfooddrinks'=> Doesn't matches
I want elastic to match it 'eventsfooddrinks'
Any change requiring more than 2 steps is not matching

I think fuzzy queries are inappropriate to your case. Fuzziness is a way to solve problem of little misspellings that human can make while typing his query. Human brain can easily skip substitution of some letter in the middle of word without loosing of overall meaning of phrase. The similar behavior we expect from search engine.
Try to use regular partial maching with ngrams analyzer:
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
}
},
"analyzer": {
"trigrams": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trigrams_filter"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "trigrams"
}
}
}
}
}
GET my_index/my_type/_search
{
"query": {
"match": {
"my_field": {
"query": "eventsfooddrinks",
"minimum_should_match": "60%"
}
}
}
}

Related

Elasticsearch query returning false results when term exceeds ngram length

The requirement is to search partial phrases in a block of text. Most of the words will be standard length. I want to keep the max_gram value down to 10. But there may be the occasional id/code with more characters than that, and these show up if I type in a query where the first 10 characters match, but then the rest don't.
For example, here is the mapping:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}
and document:
POST my_index/doc/1
{
"title": "Quick fox with id of ABCDEFGHIJKLMNOP"
}
If I run the query:
POST my_index/doc/_search
{
"query": {
"match_phrase": {
"title": {
"query": "fox wi"
}
}
}
}
It returns the document as expected. However, if I run this:
POST my_index/doc/_search
{
"query": {
"match_phrase": {
"title": {
"query": "ABCDEFGHIJxxx"
}
}
}
}
It also returns the document, when it shouldn't. It will do this if the x's are after the 10th character, but not before it. How can I avoid this?
I am using version 5.
By default, the analyzer that is used at index time is the same analyzer that is used at search time, meaning the edge_ngram analyzer is used on your search term. This is not what you want. You will end up with 10 tokens as the search terms, none of which contain those last 3 characters.
You will want to take a look at the Search Analyzer for your mapping. This documentation points out this specific use case:
Sometimes, though, it can make sense to use a different analyzer at search time, such as when using the edge_ngram tokenizer for autocomplete.
The standard analyzer may suit your needs:
{
...
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}

ElasticSearch does not respect Max NGram length while using NGram Tokenizer

I am using Ngram tokenizer and I have specified min_length as 3 and max_length as 5. However even if I try searching for a word of length greater than 5 , it still gives me the result.Its strange as ES will not index the combination with length 6 , but I am still able to retrieve the record.Is there any theory I am missing here? If not, what significance really does the max_length of NGram has? Following is the mapping that I tried..
PUT ngramtest
{
"mappings": {
"MyEntity":{
"properties": {
"testField":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5
}
}
}
}
}
Indexed a test entity as:
PUT ngramtest/MyEntity/123
{
"testField":"Z/16/000681"
}
AND, this query weirdly yeilds results for
GET ngramtest/MyEntity/_search
{
"query": {
"match": {
"testField": "000681"
}
}
}
I have tried this for 'analyzing' the string:
POST ngramtest/_analyze
{
"analyzer": "my_analyzer",
"text": "Z/16/000681."
}
Can someone please correct me if I am going wrong?
The reason for this is because your analyzer my_analyzer is used for indexing AND searching. Hence, when you search for a word of 6 characters abcdef, that word will also be analyzed by your ngram analyzer at search time and produce the tokens abc, abcd, abcde, bcd, etc, and those will match the indexed tokens.
What you need to do is to specify that you want to use the standard analyzer as search_analyzer in your mapping
"testField":{
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
Before wiping your index and repopulating it, you can test this theory simply by specifying the search analyzer to use in your match query:
GET ngramtest/MyEntity/_search
{
"query": {
"match": {
"testField": {
"query": "000681",
"analyzer": "standard"
}
}
}
}

Why match_phrase_prefix query returns wrong results with diffrent length of phrase?

I have very simple query:
POST /indexX/document/_search
{
"query": {
"match_phrase_prefix": {
"surname": "grab"
}
}
}
with mapping:
"surname": {
"type": "string",
"analyzer": "polish",
"copy_to": [
"full_name"
]
}
and definition for index (I use Stempel (Polish) Analysis for Elasticsearch plugin):
POST /indexX
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym" : {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
},
"polish_stop": {
"type": "stop",
"stopwords_path": "analysis/stopwords.txt"
},
"polish_my_stem": {
"type": "stemmer",
"rules_path": "analysis/stems.txt"
}
},
"analyzer": {
"polish_with_synonym": {
"tokenizer": "standard",
"filter": [
"synonym",
"lowercase",
"polish_stop",
"polish_stem",
"polish_my_stem"
]
}
}
}
}
}
}
For this query I get zero results. When I change phrase to GRA or GRABA it returns 1 result (GRABARZ is the surname). Why is this happening?
I tried max_expansions with values even as high as 1200 and that didn't help.
At the first glance, your analyzer stems the search term ("grab") and renders it unusable ("grabić").
Without going into details on how to resolve this, please consider getting rid of polish analyzer here. We are talking about people's names, not "ordinary" polish words.
I saw different techniques used in this case: multi-field searches, fuzzy searches, phonetic searches, dedicated plugins.
Some links:
https://www.elastic.co/blog/multi-field-search-just-got-better
http://www.basistech.com/fuzzy-search-names-in-elasticsearch/
https://www.found.no/play/gist/6c6434c9c638a8596efa
But I guess in case of polish names some kind of prefix query on non-analyzed field would suffice...

How do I configure Elasticsearch to find substrings at the beginning OR at the end of a word (but not in middle)?

I'm starting to learn Elasticsearch and now I am trying to write my first analyser configuration. What I want to achieve is that substrings are found if they are at the beginning or ending of a word. If I have the word "stackoverflow" and I search for "stack" I want to find it and when I search for "flow" I want to find it, but I do not want to find it when searching for "ackov" (in my use case this would not make sense).
I know there is the "Edge n gram tokenizer", but one analyser can only have one tokenizer and the edge n-gram can either be front or back (but not both at the same time).
And if I understood correctly, applying both version of the "Edge ngram filter" (front and back) to the analyzer, then I would not find either, because both filters need to return true, isn't it? Because "stack" wouldn't be in the ending of the word, so the back edge n gram filter would return false and the word "stackoverflow" would not be found.
So, how do I configure my analyzer to find substrings either in the end or in the beginning of a word, but not in the middle?
What can be done is to define two analyzers, one for matching at the start of a string and another to match at the end of a string. In the index settings below, I named the former one prefix_edge_ngram_analyzer and the latter one suffix_edge_ngram_analyzer. Those two analyzers can be applied to a multi-field string field to the text.prefix sub-field, respectively to the text.suffix string field.
{
"settings": {
"analysis": {
"analyzer": {
"prefix_edge_ngram_analyzer": {
"tokenizer": "prefix_edge_ngram_tokenizer",
"filter": ["lowercase"]
},
"suffix_edge_ngram_analyzer": {
"tokenizer": "keyword",
"filter" : ["lowercase","reverse","suffix_edge_ngram_filter","reverse"]
}
},
"tokenizer": {
"prefix_edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25"
}
},
"filter": {
"suffix_edge_ngram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 25
}
}
}
},
"mappings": {
"test_type": {
"properties": {
"text": {
"type": "string",
"fields": {
"prefix": {
"type": "string",
"analyzer": "prefix_edge_ngram_analyzer"
},
"suffix": {
"type": "string",
"analyzer": "suffix_edge_ngram_analyzer"
}
}
}
}
}
}
}
Then let's say we index the following test document:
PUT test_index/test_type/1
{ "text": "stackoverflow" }
We can then search either by prefix or suffix using the following queries:
# input is "stack" => 1 result
GET test_index/test_type/_search?q=text.prefix:stack OR text.suffix:stack
# input is "flow" => 1 result
GET test_index/test_type/_search?q=text.prefix:flow OR text.suffix:flow
# input is "ackov" => 0 result
GET test_index/test_type/_search?q=text.prefix:ackov OR text.suffix:ackov
Another way to query with the query DSL:
POST test_index/test_type/_search
{
"query": {
"multi_match": {
"query": "stack",
"fields": [ "text.*" ]
}
}
}
UPDATE
If you already have a string field, you can "upgrade" it to a multi-field and create the two required sub-fields with their analyzers. The way to do this would be to do this in order:
Close your index in order to create the analyzers
POST test_index/_close
Update the index settings
PUT test_index/_settings
{
"analysis": {
"analyzer": {
"prefix_edge_ngram_analyzer": {
"tokenizer": "prefix_edge_ngram_tokenizer",
"filter": ["lowercase"]
},
"suffix_edge_ngram_analyzer": {
"tokenizer": "keyword",
"filter" : ["lowercase","reverse","suffix_edge_ngram_filter","reverse"]
}
},
"tokenizer": {
"prefix_edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25"
}
},
"filter": {
"suffix_edge_ngram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 25
}
}
}
}
Re-open your index
POST test_index/_open
Finally, update the mapping of your text field
PUT test_index/_mapping/test_type
{
"properties": {
"text": {
"type": "string",
"fields": {
"prefix": {
"type": "string",
"analyzer": "prefix_edge_ngram_analyzer"
},
"suffix": {
"type": "string",
"analyzer": "suffix_edge_ngram_analyzer"
}
}
}
}
}
You still need to re-index all your documents in order for the new sub-fields text.prefix and text.suffix to be populated and analyzed.

Ignore leading zeros with Elasticsearch

I am trying to create a search bar where the most common query will be for a "serviceOrderNo". "serviceOrderNo" is not a number field in the database, it is a string field. Examples:
000000007
000000002
WO0000042
123456789
AllTextss
000000054
000000065
000000874
The most common format is just an integer proceeded by some number of zeros.
How do I set up Elasticsearch so that searching for "65" will match "000000065"? I also want to give precedence to the "serviceOrderNo" field (which I already have working). Here is where I am at right now:
{
"query": {
"multi_match": {
"query": "65",
"fields": ["serviceOrderNo^2", "_all"],
}
}
}
One way of doing this is using the lucene flavour regular exression query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html
"query": {
"regexp":{
"serviceOrderNo": "[0]*65"
}
}
Also, the Query String query also supports a small set of special characters, more limited set of regular expression characters, AS WELL AS lucene regular expressions the query would look like this:
https://www.elastic.co/guide/en/elasticsearch/reference/1.x/query-dsl-query-string-query.html
"query": {
"query_string": {
"default_field": "serviceOrderNo",
"query": "0*65"
}
}
These are fairly simple Regular expressions, both saying match the character(s) contained in the brackets [0] or the character 0 unlimited times *.
If you have the ability to reindex, or haven't indexed your data yet, you also have the ability to make this easier on yourself by writing a custom analyzer. Right now, you are using the default analyzer for Strings on your serviceOrderNo field. When you index "serviceOrderNo":"00000065" ES interprets this simply as 00000065.
Your custom analyzer could tokenize this field int both "0000065" and "65", using the same regular expression. The benefit of this is that the Regex only runs once at index time, instead of every time you run your query because ES will search against both "0000065" and "65".
You can also check out the ES website documentation on Analyzers.
"settings":{
"analysis": {
"filter":{
"trimZero": {
"type":"pattern_capture",
"patterns":"^0*([0-9]*$)"
}
},
"analyzer": {
"serviceOrderNo":{
"type":"custom",
"tokenizer":"standard",
"filter":"trimZero"
}
}
}
},
"mappings":{
"serviceorderdto": {
"properties":{
"serviceOrderNo":{
"type":"String",
"analyzer":"serviceOrderNo"
}
}
}
}
One way to do this is to use an ngram token filter so that "12345" gets tokenized as:
[ 1, 2, 3, 4, 5 ]
[ 12, 23, 34, 45 ]
[ 123, 234, 345 ]
[ 12345 ]
When tokenized this way, "65" is a match for "000000065".
To set this up, create a new index that has a custom analyzer that uses an ngram filter:
POST /my-index
{
"mappings": {
"serviceorderdto": {
"properties": {
"serviceOrderNo": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Index some data.
Then run your query:
GET /my-index/_search
{
"query": {
"multi_match": {
"query": "55",
"fields": [
"serviceOrderNo^2",
"_all"
]
}
}
}

Resources