Why match_phrase_prefix query returns wrong results with diffrent length of phrase? - elasticsearch

I have very simple query:
POST /indexX/document/_search
{
"query": {
"match_phrase_prefix": {
"surname": "grab"
}
}
}
with mapping:
"surname": {
"type": "string",
"analyzer": "polish",
"copy_to": [
"full_name"
]
}
and definition for index (I use Stempel (Polish) Analysis for Elasticsearch plugin):
POST /indexX
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym" : {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
},
"polish_stop": {
"type": "stop",
"stopwords_path": "analysis/stopwords.txt"
},
"polish_my_stem": {
"type": "stemmer",
"rules_path": "analysis/stems.txt"
}
},
"analyzer": {
"polish_with_synonym": {
"tokenizer": "standard",
"filter": [
"synonym",
"lowercase",
"polish_stop",
"polish_stem",
"polish_my_stem"
]
}
}
}
}
}
}
For this query I get zero results. When I change phrase to GRA or GRABA it returns 1 result (GRABARZ is the surname). Why is this happening?
I tried max_expansions with values even as high as 1200 and that didn't help.

At the first glance, your analyzer stems the search term ("grab") and renders it unusable ("grabić").
Without going into details on how to resolve this, please consider getting rid of polish analyzer here. We are talking about people's names, not "ordinary" polish words.
I saw different techniques used in this case: multi-field searches, fuzzy searches, phonetic searches, dedicated plugins.
Some links:
https://www.elastic.co/blog/multi-field-search-just-got-better
http://www.basistech.com/fuzzy-search-names-in-elasticsearch/
https://www.found.no/play/gist/6c6434c9c638a8596efa
But I guess in case of polish names some kind of prefix query on non-analyzed field would suffice...

Related

elasticsearch n-gram tokenizer with match_phrase not giving expected result

I created a index as follow
PUT /ngram_tokenizer
{
"mappings": {
"properties": {
"test_name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"index": {
"max_ngram_diff": 20
},
"analysis": {
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 20,
"token_chars":[
"letter",
"digit",
"whitespace",
"symbol"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
}
}
}
}
Then indexed as follow
POST /ngram_tokenizer/_doc
{
"test_name": "test document"
}
POST /ngram_tokenizer/_doc
{
"test_name": "another document"
}
Then I did a match_phrase query,
GET /ngram_tokenizer/_search
{
"query": {
"match_phrase": {
"test_name": "document"
}
}
}
Above query returns both document as expected, but below query didn't return any document
GET /ngram_tokenizer/_search
{
"query": {
"match_phrase": {
"test_name": "test"
}
}
}
also, I checked all the tokens it generates by following query
POST ngram_tokenizer/_analyze
{
"analyzer": "my_analyzer",
"text": "test document"
}
match query works fine, can you guys help me
Update
when I want to search for a phrase I have to do a match_phrase query right? Then I used n-gram tokenizer on that field because, if there is any typo in the search term, still i can get a similar doc. Also, I know that we can use fuzziness to overcome typo issues in search terms. But when I used fuzziness in match queries or fuzzy queries there was a scoring issue as mentioned here. Actually what I want is, when I do a match query I want to get results even though there is a typo in search terms. And in the match_phrase query, I should get proper results at least when I search without any typos.
It's because at search time, the analyzer used to analyze the input text is the same as the one used at indexing time, i.e. my_analyzer, and the match_phrase query is a bit more complex than the match query.
At search time, you should simply use the standard analyzer (or something different than the ngram analyzer) in order to analyze your query input.
The following query shows how to make it work as you expect.
GET /ngram_tokenizer/_search
{
"query": {
"match_phrase": {
"test_name": {
"query": "test",
"analyzer": "standard"
}
}
}
}
You can also specify the standard analyzer as a search_analyzer in your mapping
"test_name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}

Custom stopword analyzer is not woring properly

I have created an index with a custom analyzer for stop words. I want that elastic-search to ignore these words at the time of searching. Then I added one document data in elasticsearch mapping.
but when I am querying in kibana for "the" keyword with the query. It should not show any successful match, because in my_analzer I have put "the" in my_stop_word section. But it is showing the match. I have studied that if you mention one analyzer at the time of indexing in the mapping field. then it takes that analyzer by default at the time of the query.
please help!
PUT /pandey
{
"settings":
{
"analysis":
{
"analyzer":
{
"my_analyzer":
{
"tokenizer": "standard",
"filter": [
"my_stemmer",
"english_stop",
"my_stop_word",
"lowercase"
]
}
},
"filter": {
"my_stemmer": {
"type": "stemmer",
"name": "english"
},
"english_stop":{
"type": "stop",
"stopwords": "_english_"
},
"my_stop_word": {
"type": "stop",
"stopwords": ["robot", "love", "affection", "play", "the"]
}
}
}
},
"mappings": {
"properties": {
"dialog": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
PUT pandey/_doc/1
{
"dailog" : "the boy is a robot. he is in love. i play cricket"
}
GET pandey/_search
{
"query": {
"match": {
"dailog": "the"
}
}
}
A small spelling mistake can lead to this.
You defined mapping for dialog but added document with field name dailog. the dynamic field mappings behavior of elastic will index it without error. we can disable it though.
So the query, "dailog": "the" will get the result using default analyzer.

Elastic search with fuzziness more than 2 characters (Distance)

I am trying to match text fields. I am expecting results if it has 60% plus matching.
by Fuzziness we can give only 2 distance. With this
Elastic Db has record with description 'theeventsfooddrinks' and i am trying to match 'theeventsfooddrinks123', This doesn't matches.
'theeventsfooddrinks12'=> matches
'theeventsfooddri'=> Doesn't matches
'321eventsfooddrinks'=> Doesn't matches
I want elastic to match it 'eventsfooddrinks'
Any change requiring more than 2 steps is not matching
I think fuzzy queries are inappropriate to your case. Fuzziness is a way to solve problem of little misspellings that human can make while typing his query. Human brain can easily skip substitution of some letter in the middle of word without loosing of overall meaning of phrase. The similar behavior we expect from search engine.
Try to use regular partial maching with ngrams analyzer:
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
}
},
"analyzer": {
"trigrams": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trigrams_filter"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "trigrams"
}
}
}
}
}
GET my_index/my_type/_search
{
"query": {
"match": {
"my_field": {
"query": "eventsfooddrinks",
"minimum_should_match": "60%"
}
}
}
}

Why is my elastic search prefix query case-sensitive despite using lowercase filters on both index and search?

The Problem
I am working on an autocompleter using ElasticSearch 6.2.3. I would like my query results (a list of pages with a Name field) to be ordered using the following priority:
Prefix match at start of "Name" (Prefix query)
Any other exact (whole word) match within "Name" (Term query)
Fuzzy match (this is currently done on a different field to Name using a ngram tokenizer ... so I assume cannot be relevant to my problem but I would like to apply this on the Name field as well)
My Attempted Solution
I will be using a Bool/Should query consisting of three queries (corresponding to the three priorities above), using boost to define relative importance.
The issue I am having is with the Prefix query - it appears to not be lowercasing the search query despite my search analyzer having the lowercase filter. For example, the below query returns "Harry Potter" for 'harry' but returns zero results for 'Harry':
{ "query": { "prefix": { "Name.raw" : "Harry" } } }
I have verified using the _analyze API that both my analyzers do indeed lowercase the text "Harry" to "harry". Where am I going wrong?
From the ES documentation I understand I need to analyze the Name field in two different ways to enable use of both Prefix and Term queries:
using the "keyword" tokenizer to enable the Prefix query (I have applied this on a .raw field)
using a standard analyzer to enable the Term (I have applied this on the Name field)
I have checked duplicate questions such as this one but the answers have not helped
My mapping and settings are below
ES Index Mapping
{
"myIndex": {
"mappings": {
"pages": {
"properties": {
"Id": {},
"Name": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"analyzer": "keywordAnalyzer",
"search_analyzer": "pageSearchAnalyzer"
}
},
"analyzer": "pageSearchAnalyzer"
},
"Tokens": {}, // Other fields not important for this question
}
}
}
}
}
ES Index Settings
{
"myIndex": {
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"keywordAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "keyword"
},
"pageSearchAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
},
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "l2AXoENGRqafm42OSWWTAg",
"version": {}
}
}
}
}
Prefix queries don't analyze the search terms, so the text you pass into it bypasses whatever would be used as the search analyzer (in your case, the configured search_analyzer: pageSearchAnalyzer) and evaluates Harry as-is directly against the keyword-tokenized, custom-filtered harry potter that was the result of the keywordAnalyzer applied at index time.
In your case here, you'll need to do one of a few different things:
Since you're using a lowercase filter on the field, you could just always use lowercase terms in your prefix query (using application-side lowercasing if necessary)
Run a match query against an edge_ngram-analyzed field instead of a prefix query like described in the ES search_analyzer docs
Here's an example of the latter:
1) Create the index w/ ngram analyzer and (recommended) standard search analyzer
PUT my_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"pages": {
"properties": {
"name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "pageIndexAnalyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
}
2) Index some sample docs
POST my_index/pages/_bulk
{"index":{}}
{"name":"Harry Potter"}
{"index":{}}
{"name":"Hermione Granger"}
3) Run the a match query against the ngram field
POST my_index/pages/_search
{
"query": {
"match": {
"query": "Har",
"operator": "and"
}
}
}
I think it is better to use match_phrase_prefix query without using .keyword suffix. Check the docs at here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html

elasticsearch mutiple word synonms not working

I am new to elasticsearch and i am trying to configure synonyms but it is not working as expected.
I have following data in my fields
1) Techincal Lead, Module Lead, Software Engineer, Senior Software Engineer
I want if I search for tl then it should retun "Technical Lead" or "tl"
However it is returning me "Technical Lead" and "Module Lead" because lead is tokenized at index tme.
Could you please help me in getting resolve this issue with exact settings.
I have seen that index time and search time tokenization but unable to understand that.
synonyms.txt:
tl,TL => Technical Lead
se,SE => Software Engineer
sse => Senior Software Engineer
Mapping file:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
}
}
}
},
"mappings": {
"tweet": {
"properties": {
"Domain": {
"type": "string",
"analyzer": "synonym"
},
"Designation": {
"analyzer": "synonym",
"type": "string"
},
"City": {
"type": "string",
"analyzer": "synonym"
}
}
}
}
}
Your tokens are identical here, so you have that part down. What you need to do is ensure that you are doing an "AND" match instead of an "or" as it appears to be just matching on any word rather than all.
Check out your tokens:
localhost:9200/test/_analyze?analyzer=synonym&text=technical lead
localhost:9200/test/_analyze?analyzer=synonym&text=tl
And the query
{
"query": {
"match": {
"domain": {
"query": "tl",
"operator": "and"
}
}
}
}
Usually you want your search and index analyzers to be the same. However, there are many advanced examples where this is not preferable. However, in the case with synonyms, often you do not want to use synonyms in one or the other when you have expansions turned on.
i.e. tl,technical lead
However, since you are using => type of synonyms, this really doesn't matter because all words will be converted into the word on the right rather than creating a bunch of tokens for every word between the commas.

Resources