Elasticsearch is not generating numeric tokens - elasticsearch

I am having trouble getting Elasticsearch to generate proper tokens on phrases such as 15 pound chocolate cake. When performing and fielddata_field query on that field it will produce something along the results of:
pou
poun
pound
cho
choc
choco
chocol
chocola
chocolat
chocolate
cak
cake
I don't see the numbers in there at all. I have tried several different combinations of analyzer options to no avail. Below is my mappings:
{
"settings" : {
"index" : {
"analysis": {
"filter": {
"nGram_filter": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 20
},
"my_word": {
"type":"word_delimiter",
"preserve_original": "true"
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"asciifolding",
"my_word",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}}
},
"mappings": {
"categories": {
"properties": {
"id": {"type": "text"},
"sort": {"type": "long"},
"search_term":{"type":"text","analyzer": "nGram_analyzer","search_analyzer": "whitespace_analyzer", "fielddata":true}
}
}
}
}
I have tried an nGram filter like:
"nGram_filter": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
Also setting "generate_number_parts": "true" "generate_word_parts": true on the word_delimiter did not help.
EDIT
I got it working by changing the min_gram size to 2 but I was hoping to keep it at 3. I am wondering if there is a way to maintain an gram size of 3 but also keep the numbers as is?

The behavior is as expected. It is not an issue with numeric tokens but with the term length. Even if you had a string with 1 or 2 chars, it would have been filtered out as well.
min_gram : Minimum length of characters in a gram. Defaults to 1
Any token with less number of characters than the min gram will be filtered out
Hence, 15 is getting filtered out in this case.

Related

Elasticsearch sort by text field keyword

I have index with this settings
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter",
"asciifolding",
"elision",
"standard"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "32"
}
}
}
and have mapping for the name field
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
now I have several examples of names in documents. Name is one field with first name and last name inside.
--макс---
-макс -
{something} макс
макс {something}
I am using this query to find the documents with that name with alphabetical sorting
{
"query": {
"match": {
"name": {
"query": "макс",
"operator" : "and"
}
}
},
"sort": [
{"name.keyword" : "asc"}
]
}
it is bringing results as I wrote. but I expect that макс {something} will come for the first position than others because it is starting with a query which I wrote.
Can somebody help be there
So the query is by default scoring documents based on "how well they matched", this score is used to rank the "best matches first". But as soon as you define an sort you are saying ignore the query score and only using this field to rank the results. Now the results are still restricted to only documents matching the query but the idea of best match is lost unless you keep the special value _score in your sort statement somewhere.
Like this:
"sort": [
{
"productLine.keyword": {
"order": "desc"
}
},
{
"_score": {
"order": "desc"
}
}
]
Maybe you can just remove the sort and get the results you want based on default score sorting. Include a few example documents to make this fully reproducible if you want more support from the SO community

Elasticsearch : Search with special character Open & Close parentheses

Hi I am trying to search a word which has these characters in it '(' , ')' in elastic search. I am not able to get expected result.
This is the query I am using
{
"query": {
"query_string" : {
"default_field" : "name",
"query" : "\\(Pas\\)ta\""
}
}}
In the results I am getting records with "PASTORS" , "PAST", "PASCAL", "PASSION" first. I want to get name 'Pizza & (Pas)ta' in the first record in the search result as it is the best match.
Here is the analyzer for the name field in the schema
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
"name": {
"analyzer": "autocomplete",
"search_analyzer": "standard",
"type": "string"
},
Please help me to fix this, Thanks
You have used standard tokenizer which is removing ( and ) from the tokens generated. Instead of getting token (pas)ta one of the token generated is pasta and hence you are not getting match for (pas)ta.
Instead of using standard tokenizer you can use whitespace tokenizer which will retain all the special characters in the name. Change analyzer definition to below:
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}

Elasticsearch typeahead query optimization

I am currently working on typeahead support (with contains, not just starts-with) for over 100.000.000 entries (and that number could grow arbitrarily) using ElasticSearch.
The current setup works, but I was wondering if there is a better approach to it.
I'm using AWS Elasticsearch, so I don't have full control over the cluster.
My index is defined as follows:
{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase"
]
},
"edge_ngram_analyzer": {
"tokenizer": "edge_ngram_tokenizer",
"filter": "lowercase"
},
"search_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 300,
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation",
"whitespace"
]
},
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 300,
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation",
"whitespace"
]
}
}
}
},
"mappings": {
"account": {
"properties": {
"tags": {
"type": "text",
"analyzer": "ngram_analyzer",
"search_analyzer": "search_analyzer"
},
"tags_prefix": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "search_analyzer"
},
"tenantId": {
"type": "text",
"analyzer": "keyword"
},
"referenceId": {
"type": "text",
"analyzer": "keyword"
}
}
}
}
}
The structure of the documents is:
{
"tenantId": "1234",
"name": "A NAME",
"referenceId": "1234567",
"tags": [
"1234567",
"A NAME"
],
"tags_prefix": [
"1234567",
"A NAME"
]
}
The point behind the structure is that documents have searcheable fields, over which typeahead works, it's not over everything in the document, so it could be things not even in the document itself.
The search query is:
{
"from": 0,
"size": 10,
"highlight": {
"fields": {
"tags": {}
}
},
"query": {
"bool": {
"must": {
"multi_match": {
"query": "a nam",
"fields": ["tags_prefix^100", "tags"]
}
},
"filter": {
"term": {
"tenantId": "1234"
}
}
}
}
}
I'm doing a multi_match because, while I need typeahead, the results that have the match at the start need to come back first, so I followed the recommendation in here
The current setup is 10 shards, 3 master nodes (t2.mediums), 2 data/ingestion nodes (t2.mediums) with 35GB EBS disk on each, which I know is tiny given the final needs of the system, but useful enough for experimenting.
I have ~6000000 records inserted, and the response time with a cold cache is around 300ms.
I was wondering if this is the right approach or are there some optimizations I could implement to the index/query to make this more performant?
First, I think that the solution you build is good, and the optimisations you are looking for should only be considered if you have an issue with the current solution, meaning the queries are too slow. No need for pre-mature optimisations.
Second, I think that you don't need to provide the tags_prefix in your docs. all you need is to use the edge_ngram_tokenizer on the tags field, which will create the desired prefix tokens for the search to work. you can use multi fields in order to have multiple tokenizers for the same 'tags' field.
Third, use the edge_ngram_tokenizer settings carefully, especially the 'min_gram' and 'max_gram' settings. the reason is that having too high max_gram will:
a. create too many prefix tokens, will use too much space
b. decrease the index rate, as indexing takes longer
c. is not useful - you don't expect auto-complete to take into account 300 prefix characters. a better max prefix token settings should be (to my opinion) in the range of 10-20 characters max (or even less).
Good luck!

Elasticsearch - get results for autocomplete only for start of words

I'm using elasticsearch to give autosuggestions on a search bar but I want it to match only the beginning of words. Eg.
doc_name_1 = "black bag"
doc_name_2 = "abla bag"
Case 1.
On search bar string is part_string = "bla" the query I'm currently using is
query_body = {"query": {
"match": {
"_all": {
"query": part_string,
"operator": "and",
"type": "phrase_prefix"
}
}
}}
this query returns hits on doc_name_1 and doc_name_2.
What I need is to get only hit on doc_name_1 since doc_name_1 does not start the same way as the string queried.
I tried using "type":"phrase" but ES keeps going "inside" the words in the docs. Is it possible to do that just by modifying the query? or settings?
I'll share my ES settings:
{ "analysis":{
"filter":{
"nGram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram":20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}},
"analyzer":{
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter":[
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type":"custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}}}}
use edge n-gram instead of n-gram. you are breaking up the text from all postions of the word and filling the inverted index against lookup.

How can I get results that don't fully match using ElasticSearch?

If a user types
jewelr
I want to get results for
jewelry
I am using a multi_match query.
You could use EdgeNGram tokenizer:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/edgengram-tokenizer/
Specify an index time analyzer using this,
"analysis": {
"filter": {
"fulltext_ngrams": {
"side": "front",
"max_gram": 15,
"min_gram": 3,
"type": "edgeNGram"
}
},
"analyzer": {
"fulltext_index": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"asciifolding",
"fulltext_ngrams"
],
"type": "custom",
"tokenizer": "standard"
}
}
Then either specify as default index analyzer, or for a specific field mapping.
When indexing a field with value jewelry, with a 3/15 EdgeNGram, all combinations will be stored:
jew
jewe
jewel
jewelr
jewelry
Then a search for jewelr will get a match in that document.

Resources