Elasticsearch: Find values with and without whitespaces

Elasticsearch: Find values with and without whitespaces - elasticsearch

we have some fields with an articlenumbers. This articlenumbers looks like AB 987 g567 323. When i search for "AB 987 g" then i find the right product, but when i search without withespaces i dont find anything. I tried pattern_replace, but it doesnt work.
"whitespace_filter": {
"alphabets_char_filter": {
"type": "pattern_replace",
"pattern": " ",
"replacement": ""
}
How can i search for Articlenumbers with and without whitespaces?

You need to use edge_ngram along with char_filter, to achieve your use case
Adding a working example
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"char_filter": [
"replace_whitespace"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
},
"char_filter": {
"replace_whitespace": {
"type": "mapping",
"mappings": [
"\\u0020=>"
]
}
}
}
},
"mappings": {
"properties": {
"articlenumbers": {
"type": "text",
"fields": {
"analyzed": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
Index Data:
{
"articlenumbers": "AB 987 g567 323"
}
Search Query:
{
"query": {
"multi_match": {
"query": "AB987g",
"fields": [
"articlenumbers",
"articlenumbers.analyzed"
]
}
}
}
Search Result:
"hits": [
{
"_index": "65936531",
"_type": "_doc",
"_id": "1",
"_score": 1.4384104,
"_source": {
"articlenumbers": "AB 987 g567 323"
}
}
]

Related

Elasicsearch mixing NGram with Simple query string query

Currently, I am using Ngram tokenizer to-do partial matching of Employees.
I can match on FullName, Email address and Employee Number
My current setup looks as follow:
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
The problem that I am facing is that Employee Number can be 1 character long and because of the min_gram and max_gram, I can never match. I can't make the min_gram 1 either because the results do not look correct.
So I tried to mix the Ngram with a standard tokenizer and instead of doing in Multimatch search I am doing an simple_query_string.
This seems to also work partially.
My question is how can I partially match on all 3 fields bearing in mind that employee number can be 1 or 2 chars long. And exact match if I use semi quotes around a word or number
In the below example how can search for 11 and return documents 4 and 5?
Also, I would like document 2 to return if I had to search for 706 which is a partial match, but if I had to search with "7061" I would only return document 2
Full Code
PUT index
{
"settings": {
"analysis": {
"analyzer": {
"english_exact": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"my_analyzer": {
"filter": [
"lowercase",
"asciifolding"
],
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
},
"normalizer": {
"lowersort": {
"type": "custom",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"number": {
"type": "text",
"analyzer": "english",
"fields": {
"exact": {
"type": "text",
"analyzer": "english_exact"
}
}
},
"fullName": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "my_analyzer"
}
},
"analyzer": "standard"
}
}
}
}
PUT index/_doc/1
{
"number" : 1,
"fullName": "Brenda eaton"
}
PUT index/_doc/2
{
"number" : 7061,
"fullName": "Bruce wayne"
}
PUT index/_doc/3
{
"number" : 23,
"fullName": "Bruce Banner"
}
PUT index/_doc/4
{
"number" : 111,
"fullName": "Cat woman"
}
PUT index/_doc/5
{
"number" : 1112,
"fullName": "0723568521"
}
GET index/_search
{
"query": {
"simple_query_string": {
"fields": [ "fullName.ngram", "number.exact"],
"query": "11"
}
}
}

You need to change the analyzer of the number.exact field and reduce the min_gram
count to 2. Modify the index mapping as shown below
Adding a working example
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"english_exact": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"my_analyzer": {
"filter": [
"lowercase",
"asciifolding"
],
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
},
"normalizer": {
"lowersort": {
"type": "custom",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"number": {
"type": "keyword", // note this
"fields": {
"exact": {
"type": "text",
"analyzer": "my_analyzer"
}
}
},
"fullName": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "my_analyzer"
}
},
"analyzer": "standard"
}
}
}
}
Search Query:
{
"query": {
"simple_query_string": {
"fields": [ "fullName.ngram", "number.exact"],
"query": "11"
}
}
}
Search Result:
"hits": [
{
"_index": "66311552",
"_type": "_doc",
"_id": "4",
"_score": 0.9929736,
"_source": {
"number": 111,
"fullName": "Cat woman"
}
},
{
"_index": "66311552",
"_type": "_doc",
"_id": "5",
"_score": 0.8505551,
"_source": {
"number": 1112,
"fullName": "0723568521"
}
}
]
Update 1:
If you just need to search for 1, modify the data type of the number field from text type to keyword type, as shown in the index mapping above.
Search Query:
{
"query": {
"simple_query_string": {
"fields": [ "fullName.ngram", "number.exact","number"],
"query": "1"
}
}
}
Search Result will be
"hits": [
{
"_index": "66311552",
"_type": "_doc",
"_id": "1",
"_score": 1.3862942,
"_source": {
"number": 1,
"fullName": "Brenda eaton"
}
}
]
Update 2:
You can use two separate analyzers with n-gram tokenizer for the fullName field and number field. Modify with the below index mapping:
{
"settings": {
"analysis": {
"analyzer": {
"english_exact": {
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"name_analyzer": {
"filter": [
"lowercase",
"asciifolding"
],
"tokenizer": "name_tokenizer"
},
"number_analyzer": {
"filter": [
"lowercase",
"asciifolding"
],
"tokenizer": "number_tokenizer"
}
},
"tokenizer": {
"name_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
},
"number_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
},
"normalizer": {
"lowersort": {
"type": "custom",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"number": {
"type": "keyword",
"fields": {
"exact": {
"type": "text",
"analyzer": "number_analyzer"
}
}
},
"fullName": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "name_analyzer"
}
},
"analyzer": "standard"
}
}
}
}

How to match when search term has more words than index?

I have an index which is 2-4 characters with no spaces but user often searches for the "full term" which I dont have indexed but has 3 extra characters after a blank space.
Ex: I index "A1" or "A1B" or "A1B2" and the "full term" is something like
"A1 11A" or "A1B ABA" or "A1B2 2C8".
This is current mapping:
"code": {
"type": "text"
},
If he searches "A1" it bring all of them which is also correct, if he types "A1B" I want to bring only the last two and if he searches "A1B2 2C8" I want to bring only the last one.
Is that possible? If so, what would be the best search/index strategy?

Index Mapping:
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"code": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index data:
{
"code": "A1"
}
{
"code": "A1B"
}
{
"code": "A1B2"
}
Search Query:
{
"query": {
"match": {
"code": {
"query": "A1B2 2C8"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65067196",
"_type": "_doc",
"_id": "3",
"_score": 1.3486402,
"_source": {
"code": "A1B2"
}
}
]

How to highlight characters within words using elasticsearch

I have implemented auto suggest using elastic search where I am giving suggestions to users based on typed value 'where'. Most of the part works fine if I type full word or few starting characters of word. I want to highlight specific characters typed by the user, say for example user types 'ca' then suggestions should highlight 'California' only and not whole word 'California'
Highlight tag should show result like <b>Ca</b>lifornia and not <b>California</b>.
Here is my index settings
{
"settings": {
"index": {
"analysis": {
"filter": {
"edge_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 50
},
"lowercase_filter":{
"type":"lowercase",
"language": "greek"
},
"metro_synonym": {
"type": "synonym",
"synonyms_path": "metro_synonyms.txt"
},
"profession_specialty_synonym": {
"type": "synonym",
"synonyms_path": "profession_specialty_synonyms.txt"
}
},
"analyzer": {
"auto_suggest_analyzer": {
"filter": [
"lowercase",
"edge_filter"
],
"type": "custom",
"tokenizer": "whitespace"
},
"auto_suggest_search_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "whitespace"
},
"lowercase": {
"filter": [
"trim",
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"properties": {
"what_auto_suggest": {
"type": "text",
"analyzer": "auto_suggest_analyzer",
"search_analyzer": "auto_suggest_search_analyzer",
"fields": {
"raw":{
"type":"keyword"
}
}
},
"company": {
"type": "text",
"analyzer": "lowercase"
},
"where_auto_suggest": {
"type": "text",
"analyzer": "auto_suggest_analyzer",
"search_analyzer": "auto_suggest_search_analyzer",
"fields": {
"raw":{
"type":"keyword"
}
}
},
"tags_auto_suggest": {
"type": "text",
"analyzer": "auto_suggest_analyzer",
"search_analyzer": "auto_suggest_search_analyzer",
"fields": {
"raw":{
"type":"keyword"
}
}
}
}
}
}
Query i am using to pull suggestions -
GET /autosuggest_index_test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"where_auto_suggest": {
"query": "ca",
"operator": "and"
}
}
}
]
}
},
"aggs": {
"NAME": {
"terms": {
"field": "where_auto_suggest.raw",
"size": 10
}
}
},
"highlight": {
"pre_tags": [
"<b>"
],
"post_tags": [
"</b>"
],
"fields": {
"where_auto_suggest": {
}
}
}
}
One of json result that I am getting -
{
"_index" : "autosuggest_index_test",
"_type" : "_doc",
"_id" : "Calabasas CA",
"_score" : 5.755663,
"_source" : {
"where_auto_suggest" : "Calabasas CA"
},
"highlight" : {
"where_auto_suggest" : [
"<b>Calabasas</b> <b>CA</b>"
]
}
}
Can someone please suggest, how to get output here (in the where_auto_suggest) like - "<b>Ca</b>labasas <b>CA</b>"

I don't really know why but if you use a edge_ngram tokenizer instead of an edge_ngram filter you will have highlighted characters instead of highlighted words.
So in your settings, you could declare such a tokenizer :
"settings": {
"index": {
"analysis": {
"tokenizer": {
"edge_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 50,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
...
}
}
}
And change your analyzer to :
"analyzer": {
"auto_suggest_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "edge_tokenizer"
}
...
}
Thus your example request will return
{
...
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "autosuggest_index_test",
"_type": "_doc",
"_id": "grIzo28BY9R4-IxJhcFv",
"_score": 0.2876821,
"_source": {
"where_auto_suggest": "california"
},
"highlight": {
"where_auto_suggest": [
"<b>ca</b>lifornia"
]
}
}
]
}
...
}

Elasticsearch NGram Analyser - Change the Order of the results of Query

Elasticsearch Query change display results according to the scoring
The current Query gives the result of the Field title in the following order.
Quick 123
Foxes Quick
Quick
Foxes Quick Quick
Quick Foxes
Shouldn't
3. Quick be coming as a first result instead?
Also , Foxes Quick Quick has two occurances of Quick, it should have some preference in the Queried result . But it is coming at 4th poistion .
Index Settings .
{
"fundraisers": {
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "fundraisers",
"creation_date": "1546515635025",
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "my_tokenizer"
},
"search_analyzer_search": {
"filter": [
"lowercase"
],
"tokenizer": "search_tokenizer_search"
}
},
"tokenizer": {
"my_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "3",
"type": "edge_ngram",
"max_gram": "50"
},
"search_tokenizer_search": {
"token_chars": [
"letter",
"digit",
"whitespace"
],
"min_gram": "3",
"type": "ngram",
"max_gram": "50"
}
}
},
"number_of_replicas": "1",
"uuid": "mVweO4_sT3Ww00MzdLyavw",
"version": {
"created": "6020399"
}
}
}
}
}
Query
GET fundraisers/_search?explain=true
{
"query": {
"match_phrase": {
"title": {
"query": "qui",
"analyzer": "my_analyzer"
}
}
}
}
Mapping
{
"fundraisers": {
"mappings": {
"fundraisers": {
"properties": {
"status": {
"type": "text"
},
"suggest": {
"type": "completion",
"analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
},
"title": {
"type": "text",
"analyzer": "my_analyzer"
},
"twitterUrl": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"videoLinks": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"zipCode": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
Am I complicating this too much by using match_phrase,search analyzer and ngrams or is there any simpler way to achieve the expected result ?
Ref:
https://www.elastic.co/guide/en/elasticsearch/reference/6.5/query-dsl-match-query.html

Ok, first let's create a minimal and reproducible setup:
PUT test
{
"settings": {
"index": {
"number_of_shards": "1",
"number_of_replicas": "1",
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "my_tokenizer"
},
"search_analyzer_search": {
"filter": [
"lowercase"
],
"tokenizer": "search_tokenizer_search"
}
},
"tokenizer": {
"my_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "3",
"type": "edge_ngram",
"max_gram": "50"
},
"search_tokenizer_search": {
"token_chars": [
"letter",
"digit",
"whitespace"
],
"min_gram": "3",
"type": "ngram",
"max_gram": "50"
}
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
PUT test/_doc/1
{
"title": "Quick 123"
}
PUT test/_doc/2
{
"title": "Foxes Quick"
}
PUT test/_doc/3
{
"title": "Quick"
}
PUT test/_doc/4
{
"title": "Foxes Quick Quick"
}
PUT test/_doc/5
{
"title": "Quick Foxes"
}
Then let's try the simplest query:
GET test/_search
{
"query": {
"match": {
"title": {
"query": "qui"
}
}
}
}
And now your order is:
Quick
Foxes Quick Quick
Quick 123
Foxes Quick
Quick Foxes
That's pretty much what you were expecting, right? There might be other usecases, which are not covered by this query, but IMO you'll have to use multi_match and search on different analyzers, because I'm not sure a phrase_search on an edgegram makes much sense.

Get suggestion on field Elasticsearch

I am trying to make a suggestion feature with Elasticsearch.
following this article https://qbox.io/blog/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams
What I have now works but not for two words in the same sentence.
The data I have now in ES is.
{
"_index": "books",
"_type": "book",
"_id": "AVJp8p4ZTfM-Ee45GnF5",
"_score": 1,
"_source": {
"title": "Making a dish",
"author": "Jim haunter"
}
},
{
"_index": "books",
"_type": "book",
"_id": "AVJp8jaZTfM-Ee45GnF4",
"_score": 1,
"_source": {
"title": "The big fish",
"author": "Jane Stewart"
}
},
{
"_index": "books",
"_type": "book",
"_id": "AVJp8clRTfM-Ee45GnF3",
"_score": 1,
"_source": {
"title": "The Hunter",
"author": "Jame Franco"
}
}
Here is the mapping and settings.
{"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"books": {
"_all": {
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"properties": {
"title": {
"type": "string",
"index": "no"
},
"author": {
"type": "string",
"index": "no"
}
}
}
}
}
Here is the search
{
"size": 10,
"query": {
"match": {
"_all": {
"query": "Hunter",
"operator": "and",
"fuzziness": 1
}
}
}
}
when I search for "The" I get
"The big fish" and
"The hunter".
However when I enter "The Hunt" I get nothing.
To get the book again I need to enter "The Hunte".
Any suggestions?
Any help appreciated.

Removing "index": "no" from the fields worked for me. Also, since I'm using ES 2.x, I had to replace "index_analyzer" with "analyzer". So here is the mapping:
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"books": {
"_all": {
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"properties": {
"title": {
"type": "string"
},
"author": {
"type": "string"
}
}
}
}
}
Here's some code I used to test it:
http://sense.qbox.io/gist/0140ee0f5043f66e76cc3109a18d573c1d09280b

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch: Find values with and without whitespaces - elasticsearch

Related

Elasicsearch mixing NGram with Simple query string query

How to match when search term has more words than index?

How to highlight characters within words using elasticsearch

Elasticsearch NGram Analyser - Change the Order of the results of Query

Get suggestion on field Elasticsearch

Categories

Resources