Query in ElasticSearch to match part of the word - elasticsearch

I am trying to write a query in ElasticSearch which matches contiguous characters in the words. So, if my index has "John Doe", I should still see "John Doe" returned by Elasticsearch for the below searches.
john doe
john do
ohn do
john
n doe
I have tried the below query so far.
{
"query": {
"multi_match": {
"query": "term",
"operator": "OR",
"type": "phrase_prefix",
"max_expansions": 50,
"fields": [
"Field1",
"Field2"
]
}
}
}
But this also returns unnessary matches like I will still get "John Doe" when i type john x.

As explained in my comment above, prefix wildcards should be avoided at all costs as your index grows since that will force ES to do full index scans. I'm still convinced that ngrams (more precisely edge-ngrams) is the way to go, so I'm taking a stab at it below.
The idea is to index all the suffixes of the input and then use a prefix query to match any suffix as searching for prefixes doesn't suffer the same performance issues as searching for suffixes. So the idea is to index john doe as follows:
john doe
ohn doe
hn doe
n doe
doe
oe
e
That way, using a prefix query we can match any sub-part of those tokens which effectively achieves the goal of matching partial contiguous words while at the same time ensuring good performance.
The definition of the index would go like this:
PUT my_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"reverse",
"suffixes",
"reverse"
]
}
},
"filter": {
"suffixes": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 20
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
Then we can index a sample document:
PUT my_index/doc/1
{
"name": "john doe"
}
And finally all of the following searches will return the john doe document:
POST my_index/_search
{
"query": {
"prefix": {
"name": "john doe"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "john do"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "ohn do"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "john"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "n doe"
}
}
}

This is what worked for me.
Instead of an ngram, index your data as keyword.
And use wildcard regex match to match the words.
"query": {
"bool": {
"should": [
{
"wildcard": { "Field1": "*" + term + "*" }
},
{
"wildcard": { "Field2": "*" + term + "*" }
}
],
"minimum_should_match": 1
}
}

Here is a updated fix
link to the code
more options with tokenizers
create index with
body = {
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}

Related

How does phrase searching and phrase search with ~N interact with quote_field_suffix in a simple query string query?

For example, given:
PUT index
{
"settings": {
"analysis": {
"analyzer": {
"english_exact": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"body": {
"type": "text",
"analyzer": "english",
"fields": {
"exact": {
"type": "text",
"analyzer": "english_exact"
}
}
}
}
}
}
PUT index/_doc/1
{
"body": "Ski resorts"
}
PUT index/_doc/1
{
"body": "Ski house resorts"
}
What happens with the following queries?
{
"query": {
"simple_query_string": {
"fields": [ "body" ],
"quote_field_suffix": ".exact",
"query": "\"ski resort\""
}
}
}
{
"query": {
"simple_query_string": {
"fields": [ "body" ],
"quote_field_suffix": ".exact",
"query": "\"ski resort\"~2"
}
}
}
Will the ".exact" extend to the entire phrase, so in this case the first query would get no results?
How could you do a phrase search that is not exact when using quote "quote_field_suffix": ".exact"?
Will the ".exact" extend to the entire phrase, so in this case the first query would get no results?
Yes, Your understanding is correct.
Documentation says, Suffix appended to quoted text in the query string.
So, it will search for exact match for ski resort. It is not there so it will return empty result.
How could you do a phrase search that is not exact when using quote "quote_field_suffix": ".exact"?
{
"query": {
"simple_query_string": {
"fields": [ "body" ],
"quote_field_suffix": ".exact",
"query": "ski resort~2"
}
}
}
It is not exact because it brings ski resorts also.

Elasticsearch Edge NGram tokenizer higher score when word begins with n-gram

Suppose there is the following mapping with Edge NGram Tokenizer:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
},
"autocomplete_search": {
"tokenizer": "whitespace"
}
},
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
}
}
}
},
"mappings": {
"tag": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
And the following documents are indexed:
POST /tag/tag/_bulk
{"index":{}}
{"name" : "HITS FIND SOME"}
{"index":{}}
{"name" : "TRENDING HI"}
{"index":{}}
{"name" : "HITS OTHER"}
Then searching
{
"query": {
"match": {
"name": {
"query": "HI"
}
}
}
}
yields all with the same score, or TRENDING - HI with a score higher than one of the others.
How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME and HITS OTHER to have a higher score than TRENDING HI; at the same time TRENDING HI should be in the results.
Highlighter is also used, so the given solution shouldn't mess it up.
The highlighter used in query is:
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {}
}
}
Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must to filter document, then should to score. A common use case is to use different analyzers on a same field (by using the keyword fields in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
{
"query": {
"bool": {
"must" : [
{
"match": {
"name": "HI"
}
}
],
"should": [
{
"prefix": {
"name": "HI"
}
}
]
}
},
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {
"highlight_query": {
"match": {
"name": "HI"
}
}
}
}
}
}
In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": "HI"
}
},
{
"match_phrase_prefix": {
"name": "HI"
}
}
]
}
}
}
The match term will match on all three results, but the match_phrase_prefix won't match on TRENDING HI. As a result, you'll get all three items in the results, but TRENDING HI will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.
A possible solution for this problem is to use multifields. They allow for indexing of the same data from your source document in different ways. In your case you could index the name field as default text, then as ngrams and also as edgengrams. Then the query would have to be a bool query comparing with all those different fields.
The final score of documents is composed of the match value for each one. Those matches are also called signals, signalling that there is a match between the query and the document. The document with most signals matching gets the highest score.
In your case all documents would match the ngram HI. But only the HITS FIND SOME and the HITS OTHER document would get the edgengram additional score. This would give those two documents a boost and put them on top. The complication with this is that you have to make sure that the edgengram doesn't split on whitespaces, because then the HI at the end would get the same score as in the beginning of the document.
Here is an example mapping and query for your case:
PUT /tag/
{
"settings": {
"analysis": {
"analyzer": {
"edge_analyzer": {
"tokenizer": "edge_tokenizer"
},
"kw_analyzer": {
"tokenizer": "kw_tokenizer"
},
"ngram_analyzer": {
"tokenizer": "ngram_tokenizer"
},
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
},
"autocomplete_search": {
"tokenizer": "whitespace"
}
},
"tokenizer": {
"kw_tokenizer": {
"type": "keyword"
},
"edge_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
},
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
},
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
}
}
}
},
"mappings": {
"tag": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"fields": {
"edge": {
"type": "text",
"analyzer": "edge_analyzer"
},
"ngram": {
"type": "text",
"analyzer": "ngram_analyzer"
}
}
}
}
}
}
}
And a query:
POST /tag/_search
{
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {
"match": {
"name.edge": {
"query": "HI"
}
}
},
"boost": "5",
"boost_mode": "multiply"
}
},
{
"match": {
"name.ngram": {
"query": "HI"
}
}
},
{
"match": {
"name": {
"query": "HI"
}
}
}
]
}
}
}

Wildcard / regexp in a phrase which has space

Create an index:
Here I an using edge_ngram
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"city": {
"type": "keyword",
"fields": {
"raw": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
}
POST my_index/my_type/1
{
"text": "2 #Quick Foxes lived and died"
}
POST my_index/my_type/2
{
"text": "2 #Quick Foxes lived died"
}
Now when we search
GET my_index/my_type/_search
{
"query": {
"query_string": {
"default_operator" : "AND",
"query" : "f* d*",
"fields": ["text.raw"]
}
}
}
Only ID 2 should list. But nothing returns.
when you try this
GET my_index/my_type/_search
{
"query": {
"query_string": {
"default_operator" : "AND",
"query" : "f* d*",
"fields": ["text"]
}
}
}
It will return both.
If we have an index with huge data and if we wanted to search wildcards, how we will do it?
single keyword will work, but if we add phrases like which i mentioned in the example, it won't give you any proper result.
To generate a regex expression you can follow these websites:-
Generate regex expression here- http://buildregex.com/
and test your string with expression generated from here https://regex101.com/

Elastic search Unorderered Partial phrase matching with ngram

Maybe I am going down the wrong route, but I am trying to set up Elasticsearch to use Partial Phrase matching to return parts of words from any order of a sentence.
Eg. I have the following input
test name
tester name
name test
namey mcname face
test
And I hope to do a search for "test name" (or "name test"), and I hope all of these return (hopefully sorted in order of score). I can do partial searches, and also can do out of order searches, but not able to combine the 2. I am sure this would be a very common issue.
Below is my Settings
{
"myIndex": {
"settings": {
"index": {
"analysis": {
"filter": {
"mynGram": {
"type": "nGram",
"min_gram": "2",
"max_gram": "5"
}
},
"analyzer": {
"custom_analyser": {
"filter": [
"lowercase",
"mynGram"
],
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "5"
}
}
}
}
}
}
}
My mapping
{
"myIndex": {
"mappings": {
"myIndex": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"analyzer": "custom_analyser"
}
}
}
}
}
}
And my query
{
"query": {
"bool": {
"must": [{
"match_phrase": {
"name": {
"query": "test name",
"slop": 5
}
}
}]
}
}
}
Any help would be greatly appreciated.
Thanks in advance
not sure if you found your solution - I bet you did because this is such an old post, but I was on the hunt for the same thing and found this: Query-Time Search-as-you-type
Look up slop.

ElasticSearch: How to use edge_ngram and have real relevant hits to display first

I'm new with elasticsearch and I'm trying to develop a search for an ecommerce to suggested 5~10 matching products to the user.
As it should work while the user is typing, we found in the official documentation the use of edge_ngram and it KIND OF worked. But as we searched to test, the results were not the expected. As shows the example below (in our test)
Searching example
As it is shown in the image, the result for the term "Furadeira" (Power Drill) returns accessories before the power drill itself. How can I enhance the results? Even the order where the match is found in the string would help me, I guess.
So, this is the code I have until now:
//PUT example
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
},
"portuguese_stop": {
"type": "stop",
"stopwords": "_portuguese_"
},
"portuguese_stemmer": {
"type": "stemmer",
"language": "light_portuguese"
}
},
"analyzer": {
"portuguese": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"portuguese_stop",
"portuguese_stemmer"
]
},
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
/* mapping */
//PUT /example/products/_mapping
{
"products": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
/* Search */
//GET /example/products/_search
{
"query" : {
"query_string": {
"query" : "furadeira",
"type" : "most_fields", // Tried without this aswell
"fields" : [
"name^8",
"model^10",
"manufacturer^4",
"description"
]
}
}
}
/* Product example */
// PUT example/products/38313
{
"name": "FITA VEDA FRESTA (ESPUMA 4503) 12X5 M [ H0000164055 ]",
"description": "Caracteristicas do produto:Ve…Diminui ruidos indesejaveis.",
"price":21.90,
"product_id": 38313,
"image": "http://placehold.it/200x200",
"quantity": 92,
"width": 20.200,
"height": 1.500,
"length": 21.500,
"weight": 0.082,
"model": "167083",
"manufacturer": "3M DO BRASIL"
}
Thanks in advance.
you could enhance your query to be a so-called boolean query, which contains your existing query in a must clause, but have an additional query in a should clause, that matches exactly (not using the ngrammed field). If the query matches the should clause it will be scored higher.
See the bool query documentation.
let's assume you have a field that differentiates the Main product from Accessories. I call it level_field.
now you can have two approaches to go:
1) boost up The Main product _score by adding 'should' operation:
put your main query in the must operation and in should operation use level_field to boost the _score of documents which are the Main products.
{
"query": {
"bool": {
"must": {
"match": {
"name": {
"query": "furadeira"
}
}
},
"should": [
{ "match": {
"level_field": {
"query": "level1",
"boost": 3
}
}},
{ "match": {
"level_field": {
"query": "level2",
"boost": 2
}
}}
]
}
}
}
2) in second approach you can decrease _score for documents that they are not the Main products by using boosting query:
{
"query": {
"boosting": {
"positive": {
"query_string": {
"query" : "furadeira",
"type" : "most_fields",
"fields" : [
"name^8",
"model^10",
"manufacturer^4",
"description"
]
}
}
},
"negative": {
"term": {
"level_field": {
"value": "level2"
}
}
},
"negative_boost": 0.2
}
}
}
I hope it helps

Resources