Could I combine wildcard and fulltext search in Elasticsearch? - elasticsearch

For example, I have some titles data in Elasticsearch likes this,
gamexxx_nightmare,
gamexxx_little_guy
Then I input
game => search out gamexxx_nightmare and gamexxx_little_guy
little guy => search out gamexxx_little_guy ?
first I think I will use a wildcard to make game match gamexxx, the second it is fulltext search?
How to combine them in one DSL??

While Jaspreet's answer is right but doesn't combine both the requirements in one query DSL as asked by OP in his question How to combine them in one DSL??.
It's an enhancement to Jaspreet's solution as I am also not using the wild-card and even avoiding the n-gram analyzer which is too costly(increases the index size) and requires re-indexing if requirement changes.
One Search query to combine both the requirement can be done as below:
Index mapping
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"replace_underscore" -->note this
]
}
},
"char_filter": {
"replace_underscore": {
"type": "mapping",
"mappings": [
"_ => \\u0020"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer" : "my_analyzer"
}
}
}
}
Index your sample docs
{
"title" : "gamexxx_little_guy"
}
And
{
"title" : "gamexxx_nightmare"
}
Single Search query
{
"query": {
"bool": {
"must": [ --> note this
{
"bool": {
"must": [
{
"prefix": {
"title": {
"value": "game"
}
}
}
]
}
},
{
"bool": {
"must": [
{
"match": {
"title": {
"query": "little guy"
}
}
}
]
}
}
]
}
}
}
Result
{
"_index": "so-46873023",
"_type": "_doc",
"_id": "2",
"_score": 2.2814486,
"_source": {
"title": "gamexxx_little_guy"
}
}
Important points:
The first part of the query is prefix query, which would match the game in both the documents. (This would avoid costly regex).
The second part is allowing the full-text search, to enable this, I used custom analyzer which replaces the _ with whitespace, so you don't need expensive (n-grams in index) and simple match query would fetch the results.
Above query, returns result matching both the criteria, you can change the high level, bool clause to should from must if, you want to return matching any criteria.

NGrams have better performance than wildcards. For wild card all documents have to be scanned to see which match the pattern. Ngrams break a text in small tokens. Ex Quick Foxes will stored as [ Qui, uic, ick, Fox, oxe, xes ] depending on min_gram and max_gram size.
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"text":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Query
GET my_index/_search
{
"query": {
"match": {
"text": "little guy"
}
}
}
If you want to go with wildcard only then you can search on not_analyzed string. This will handle spaces between words
"wildcard": {
"text.keyword": {
"value": "*gamexxx*"
}
}

Related

How to find word 'food2u' by search 'food' in Elasticsearch?

I am a rookie who just started learning elasticsearch,And I want to find word like 'food2u' by search keyword 'food'.But I can only get the results like 'Food Repo','Give Food' etc. The field's Mapping is 'text' and this is my query
GET api/_search
{"query": {
"match": {
"Name": {
"query": "food"
}
}
},
"_source":{
"includes":["Name"]
}
}
You are getting the results like 'Food Repo','Give Food', as the text field uses a standard analyzer if no analyzer is specified. Food Repo gets tokenized into food and repo. Similarly Give Food gets tokenized into give and food.
But food2u gets tokenized into food2u. Since there is no matching token ("food"), you will not get the food2u document.
You need to use edge_ngram tokenizer to do a partial text match.
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 4,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"name":"food2u"
}
Search Query:
{
"query": {
"match": {
"name": "food"
}
}
}
Search Result:
"hits": [
{
"_index": "67552800",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "food2u"
}
}
]
If you don't want to change the mapping, you can even use a wildcard query to return the matching documents
{
"query": {
"wildcard": {
"Name": {
"value": "food*"
}
}
}
}
OR you can even use query_string with wildcard
{
"query": {
"query_string": {
"query": "food*",
"fields": [
"Name"
]
}
}
}

Search in Elasticsearch for a string containing the "not" keyword

I am using ElasticSearch on AWS (7.9 version) and I am trying to distinguish between two strings.
My main target is to split the search results on "Found" and on "Not found".
The generic question is how to search for "not" keyword.
Two example messages you can see below.
"CachingServiceOne:Found in cache - Retrieve."
"CachingServiceThree:Not found in cache - Create new."
You can use ngram tokenizer, to search for "not" on "title" field.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"title":"CachingServiceThree:Not found in cache - Create new."
}
{
"title":"CachingServiceOne:Found in cache - Retrieve."
}
Search Query:
{
"query":{
"match":{
"title":"Not"
}
}
}
Search Result:
"hits": [
{
"_index": "67093372",
"_type": "_doc",
"_id": "2",
"_score": 0.6720003,
"_source": {
"title": "CachingServiceThree:Not found in cache - Create new."
}
}
]
Well, the problem seems to be indeed the way the default analyzer works, and not the fact that I could not search for the not word. That is why I accepted the answer. But I would like to add another take. For the sake of simplicity.
Default analyzer does not split words on :.
That means, we have to search for title:CachingServiceThree\:Not.
Where title is the field name and : must be escaped \:.
What did the trick was title:*\:Not and title:*\:Found using the KQL syntax.
Using the wildcard did the trick to fetch everything. I am wondering whether using an array of all the actual values will be quicker.
That translated through the Inspect panel into:
{
"query": {
"bool": {
"filter": [
{
"bool": {
"should": [
{
"query_string": {
"fields": [
"title"
],
"query": "*\\:Not"
}
}
],
"minimum_should_match": 1
}
}
]
}
}
}

How can I get auto-suggestions for synonyms match in elasticsearch

I'm using the code below and it does not give auto-suggestion as curd when i type "cu"
But it does match the document with yogurt which is correct.
How can I get both auto-complete for synonym words and document match for the same?
PUT products
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_graph"
]
}
},
"filter": {
"synonym_graph": {
"type": "synonym_graph",
"synonyms": [
"yogurt, curd, dahi"
]
}
}
}
}
}
}
PUT products/_mapping
{
"properties": {
"description": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
POST products/_doc
{
"description": "yogurt"
}
GET products/_search
{
"query": {
"match": {
"description": "cu"
}
}
}
When you provide a list of synonyms in a synonym_graph filter it simply means that ES will treat any of the synonyms interchangeably. But when they're analyzed via the standard analyzer, only full-word tokens will be produced:
POST products/_analyze?filter_path=tokens.token
{
"text": "yogurt",
"field": "description"
}
yielding:
{
"tokens" : [
{
"token" : "curd"
},
{
"token" : "dahi"
},
{
"token" : "yogurt"
}
]
}
As such, a regular match_query won't cut it here because the standard analyzer hasn't provided it with enough context in terms of matchable substrings (n-grams).
In the meantime you can replace match with match_phrase_prefix which does exactly what you're after -- match an ordered sequence of characters while taking into account the synonyms:
GET products/_search
{
"query": {
"match_phrase_prefix": {
"description": "cu"
}
}
}
But that, as the query name suggests, is only going to work for prefixes. If you fancy an autocomplete that suggests terms regardless of where the substring matches occur, have a look at my other answer where I talk about leveraging n-grams.

how to query for phrases(shingles) in Elasticsearch

I have the following string "Word1 Word2 StopWord1 StopWord2 Word3 Word4".
When I query for this string using ["bool"]["must"]["match"], I would like to return all text that matches "Word1Word2" and/or "Word3Word4".
I have created an analyzer that I would like to use for indexing and searching.
Using analyze API, I have confirmed that indexing is being done correctly. The shingles returned are "Word1Word2" and "Word3Word4"
I want to query so that text matching "Word1Word2" and/or "Word3Word4" are returned. How can I do this dynamically - meaning, I don't know up front how many shingles will be generated, so I don't know how many match_phrase to code up in a query.
"should":[
{ "match_phrase" : {"content": phrases[0]}},
{ "match_phrase" : {"content": phrases[1]}}
]
To query for shingles(and unigrams), you could set up your mappings to handle them cleanly in separate fields. In the example below, the field "shingles" will be used to analyze and retrieve shingles, while the implicit field will be used to handle unigrams.
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
}
},
"analyzer": {
"my_shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_shingle_filter"
]
}
}
}
}
}
PUT /my_index/_mapping/my_type
{
"my_type": {
"properties": {
"title": {
"type": "string",
"fields": {
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
}
}
}
}
}
}
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": {
"match": {
"title": "<your query string>"
}
},
"should": {
"match": {
"title.shingles": "<your query string"
}
}
}
}
}
Ref. Elasticsearch: The Definitive Guide....

Elasticsearch Edge NGram tokenizer higher score when word begins with n-gram

Suppose there is the following mapping with Edge NGram Tokenizer:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
},
"autocomplete_search": {
"tokenizer": "whitespace"
}
},
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
}
}
}
},
"mappings": {
"tag": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
And the following documents are indexed:
POST /tag/tag/_bulk
{"index":{}}
{"name" : "HITS FIND SOME"}
{"index":{}}
{"name" : "TRENDING HI"}
{"index":{}}
{"name" : "HITS OTHER"}
Then searching
{
"query": {
"match": {
"name": {
"query": "HI"
}
}
}
}
yields all with the same score, or TRENDING - HI with a score higher than one of the others.
How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME and HITS OTHER to have a higher score than TRENDING HI; at the same time TRENDING HI should be in the results.
Highlighter is also used, so the given solution shouldn't mess it up.
The highlighter used in query is:
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {}
}
}
Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must to filter document, then should to score. A common use case is to use different analyzers on a same field (by using the keyword fields in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
{
"query": {
"bool": {
"must" : [
{
"match": {
"name": "HI"
}
}
],
"should": [
{
"prefix": {
"name": "HI"
}
}
]
}
},
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {
"highlight_query": {
"match": {
"name": "HI"
}
}
}
}
}
}
In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": "HI"
}
},
{
"match_phrase_prefix": {
"name": "HI"
}
}
]
}
}
}
The match term will match on all three results, but the match_phrase_prefix won't match on TRENDING HI. As a result, you'll get all three items in the results, but TRENDING HI will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.
A possible solution for this problem is to use multifields. They allow for indexing of the same data from your source document in different ways. In your case you could index the name field as default text, then as ngrams and also as edgengrams. Then the query would have to be a bool query comparing with all those different fields.
The final score of documents is composed of the match value for each one. Those matches are also called signals, signalling that there is a match between the query and the document. The document with most signals matching gets the highest score.
In your case all documents would match the ngram HI. But only the HITS FIND SOME and the HITS OTHER document would get the edgengram additional score. This would give those two documents a boost and put them on top. The complication with this is that you have to make sure that the edgengram doesn't split on whitespaces, because then the HI at the end would get the same score as in the beginning of the document.
Here is an example mapping and query for your case:
PUT /tag/
{
"settings": {
"analysis": {
"analyzer": {
"edge_analyzer": {
"tokenizer": "edge_tokenizer"
},
"kw_analyzer": {
"tokenizer": "kw_tokenizer"
},
"ngram_analyzer": {
"tokenizer": "ngram_tokenizer"
},
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
},
"autocomplete_search": {
"tokenizer": "whitespace"
}
},
"tokenizer": {
"kw_tokenizer": {
"type": "keyword"
},
"edge_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
},
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
},
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
}
}
}
},
"mappings": {
"tag": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"fields": {
"edge": {
"type": "text",
"analyzer": "edge_analyzer"
},
"ngram": {
"type": "text",
"analyzer": "ngram_analyzer"
}
}
}
}
}
}
}
And a query:
POST /tag/_search
{
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {
"match": {
"name.edge": {
"query": "HI"
}
}
},
"boost": "5",
"boost_mode": "multiply"
}
},
{
"match": {
"name.ngram": {
"query": "HI"
}
}
},
{
"match": {
"name": {
"query": "HI"
}
}
}
]
}
}
}

Resources