How to handle unordered multi-word query in Elasticsearch?

How to handle unordered multi-word query in Elasticsearch? - elasticsearch

I have the following situation:
The simple analyzer processes the text "The brown and green fox are quick" and adds the individual lower case terms to the index.
I want to use the following query phrase against my indices: "quick brown f"
I use the match_phrase_prefix in order to run this search:
{
"query": {
"match_phrase_prefix" : {
"message" : {
"query" : "quick brown f",
"max_expansions" : 10
}
}
}
}
Unfortunately no results are returned since the order of the terms does not match up with the query terms. I will get results back if I use a match query and if I use the complete terms. It seems that match_phrase_prefix is checking the order:
This query works by creating a phrase query out of quick and brown
(i.e. the term quick must exist and must be followed by the term
brown).
My question:
Is there a way to run a query which does handle incomplete terms and returns results regardless of the order of the terms in the source document? The only option I can currently think of is to manually create a query for each term in the input query (e.g.: quick, brown, f) and combine them using a bool query.

The edge_ngram analyzer should do what you want. If you set it up with a min_gram value set to 1 and the max gram value set to 10 the document would have the necessary tokens stored. Then you can apply the standard analyzer to your query text and match it against the edge_ngram document field.
The example in the documentation is almost exactly the same as your requested solution. Note the use of the explicit and operator in the query to make sure all of your search tokens, partial or otherwise, are matched.
From the documentation for 5.6:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
PUT my_index/doc/1
{
"title": "Quick Foxes"
}
POST my_index/_refresh
GET my_index/_search
{
"query": {
"match": {
"title": {
"query": "Quick Fo",
"operator": "and"
}
}
}
}

Related

elasticsearch n-gram tokenizer with match_phrase not giving expected result

I created a index as follow
PUT /ngram_tokenizer
{
"mappings": {
"properties": {
"test_name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"index": {
"max_ngram_diff": 20
},
"analysis": {
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 20,
"token_chars":[
"letter",
"digit",
"whitespace",
"symbol"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
}
}
}
}
Then indexed as follow
POST /ngram_tokenizer/_doc
{
"test_name": "test document"
}
POST /ngram_tokenizer/_doc
{
"test_name": "another document"
}
Then I did a match_phrase query,
GET /ngram_tokenizer/_search
{
"query": {
"match_phrase": {
"test_name": "document"
}
}
}
Above query returns both document as expected, but below query didn't return any document
GET /ngram_tokenizer/_search
{
"query": {
"match_phrase": {
"test_name": "test"
}
}
}
also, I checked all the tokens it generates by following query
POST ngram_tokenizer/_analyze
{
"analyzer": "my_analyzer",
"text": "test document"
}
match query works fine, can you guys help me
Update
when I want to search for a phrase I have to do a match_phrase query right? Then I used n-gram tokenizer on that field because, if there is any typo in the search term, still i can get a similar doc. Also, I know that we can use fuzziness to overcome typo issues in search terms. But when I used fuzziness in match queries or fuzzy queries there was a scoring issue as mentioned here. Actually what I want is, when I do a match query I want to get results even though there is a typo in search terms. And in the match_phrase query, I should get proper results at least when I search without any typos.

It's because at search time, the analyzer used to analyze the input text is the same as the one used at indexing time, i.e. my_analyzer, and the match_phrase query is a bit more complex than the match query.
At search time, you should simply use the standard analyzer (or something different than the ngram analyzer) in order to analyze your query input.
The following query shows how to make it work as you expect.
GET /ngram_tokenizer/_search
{
"query": {
"match_phrase": {
"test_name": {
"query": "test",
"analyzer": "standard"
}
}
}
}
You can also specify the standard analyzer as a search_analyzer in your mapping
"test_name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}

Elastic search with fuzziness more than 2 characters (Distance)

I am trying to match text fields. I am expecting results if it has 60% plus matching.
by Fuzziness we can give only 2 distance. With this
Elastic Db has record with description 'theeventsfooddrinks' and i am trying to match 'theeventsfooddrinks123', This doesn't matches.
'theeventsfooddrinks12'=> matches
'theeventsfooddri'=> Doesn't matches
'321eventsfooddrinks'=> Doesn't matches
I want elastic to match it 'eventsfooddrinks'
Any change requiring more than 2 steps is not matching

I think fuzzy queries are inappropriate to your case. Fuzziness is a way to solve problem of little misspellings that human can make while typing his query. Human brain can easily skip substitution of some letter in the middle of word without loosing of overall meaning of phrase. The similar behavior we expect from search engine.
Try to use regular partial maching with ngrams analyzer:
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
}
},
"analyzer": {
"trigrams": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trigrams_filter"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "trigrams"
}
}
}
}
}
GET my_index/my_type/_search
{
"query": {
"match": {
"my_field": {
"query": "eventsfooddrinks",
"minimum_should_match": "60%"
}
}
}
}

Elasticsearch query returning false results when term exceeds ngram length

The requirement is to search partial phrases in a block of text. Most of the words will be standard length. I want to keep the max_gram value down to 10. But there may be the occasional id/code with more characters than that, and these show up if I type in a query where the first 10 characters match, but then the rest don't.
For example, here is the mapping:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}
and document:
POST my_index/doc/1
{
"title": "Quick fox with id of ABCDEFGHIJKLMNOP"
}
If I run the query:
POST my_index/doc/_search
{
"query": {
"match_phrase": {
"title": {
"query": "fox wi"
}
}
}
}
It returns the document as expected. However, if I run this:
POST my_index/doc/_search
{
"query": {
"match_phrase": {
"title": {
"query": "ABCDEFGHIJxxx"
}
}
}
}
It also returns the document, when it shouldn't. It will do this if the x's are after the 10th character, but not before it. How can I avoid this?
I am using version 5.

By default, the analyzer that is used at index time is the same analyzer that is used at search time, meaning the edge_ngram analyzer is used on your search term. This is not what you want. You will end up with 10 tokens as the search terms, none of which contain those last 3 characters.
You will want to take a look at the Search Analyzer for your mapping. This documentation points out this specific use case:
Sometimes, though, it can make sense to use a different analyzer at search time, such as when using the edge_ngram tokenizer for autocomplete.
The standard analyzer may suit your needs:
{
...
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}

ElasticSearch does not respect Max NGram length while using NGram Tokenizer

I am using Ngram tokenizer and I have specified min_length as 3 and max_length as 5. However even if I try searching for a word of length greater than 5 , it still gives me the result.Its strange as ES will not index the combination with length 6 , but I am still able to retrieve the record.Is there any theory I am missing here? If not, what significance really does the max_length of NGram has? Following is the mapping that I tried..
PUT ngramtest
{
"mappings": {
"MyEntity":{
"properties": {
"testField":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5
}
}
}
}
}
Indexed a test entity as:
PUT ngramtest/MyEntity/123
{
"testField":"Z/16/000681"
}
AND, this query weirdly yeilds results for
GET ngramtest/MyEntity/_search
{
"query": {
"match": {
"testField": "000681"
}
}
}
I have tried this for 'analyzing' the string:
POST ngramtest/_analyze
{
"analyzer": "my_analyzer",
"text": "Z/16/000681."
}
Can someone please correct me if I am going wrong?

The reason for this is because your analyzer my_analyzer is used for indexing AND searching. Hence, when you search for a word of 6 characters abcdef, that word will also be analyzed by your ngram analyzer at search time and produce the tokens abc, abcd, abcde, bcd, etc, and those will match the indexed tokens.
What you need to do is to specify that you want to use the standard analyzer as search_analyzer in your mapping
"testField":{
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
Before wiping your index and repopulating it, you can test this theory simply by specifying the search analyzer to use in your match query:
GET ngramtest/MyEntity/_search
{
"query": {
"match": {
"testField": {
"query": "000681",
"analyzer": "standard"
}
}
}
}

Why match_phrase_prefix query returns wrong results with diffrent length of phrase?

I have very simple query:
POST /indexX/document/_search
{
"query": {
"match_phrase_prefix": {
"surname": "grab"
}
}
}
with mapping:
"surname": {
"type": "string",
"analyzer": "polish",
"copy_to": [
"full_name"
]
}
and definition for index (I use Stempel (Polish) Analysis for Elasticsearch plugin):
POST /indexX
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym" : {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
},
"polish_stop": {
"type": "stop",
"stopwords_path": "analysis/stopwords.txt"
},
"polish_my_stem": {
"type": "stemmer",
"rules_path": "analysis/stems.txt"
}
},
"analyzer": {
"polish_with_synonym": {
"tokenizer": "standard",
"filter": [
"synonym",
"lowercase",
"polish_stop",
"polish_stem",
"polish_my_stem"
]
}
}
}
}
}
}
For this query I get zero results. When I change phrase to GRA or GRABA it returns 1 result (GRABARZ is the surname). Why is this happening?
I tried max_expansions with values even as high as 1200 and that didn't help.

At the first glance, your analyzer stems the search term ("grab") and renders it unusable ("grabić").
Without going into details on how to resolve this, please consider getting rid of polish analyzer here. We are talking about people's names, not "ordinary" polish words.
I saw different techniques used in this case: multi-field searches, fuzzy searches, phonetic searches, dedicated plugins.
Some links:
https://www.elastic.co/blog/multi-field-search-just-got-better
http://www.basistech.com/fuzzy-search-names-in-elasticsearch/
https://www.found.no/play/gist/6c6434c9c638a8596efa
But I guess in case of polish names some kind of prefix query on non-analyzed field would suffice...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to handle unordered multi-word query in Elasticsearch? - elasticsearch

Related

elasticsearch n-gram tokenizer with match_phrase not giving expected result

Elastic search with fuzziness more than 2 characters (Distance)

Elasticsearch query returning false results when term exceeds ngram length

ElasticSearch does not respect Max NGram length while using NGram Tokenizer

Why match_phrase_prefix query returns wrong results with diffrent length of phrase?

Categories

Resources