ElasticSearch Ngram tokenizer not working - elasticsearch

Below is a shortened version of my cfg mapping, using ngram tokenizer (I have not included all fields and only region below).
I have region data indexed as "Stuttgart" and "Munich" for two separate documents.
When I search for text either "Stut" or "tutt" it does not return me any documents.
Is there anything I am missing in the config ?
{
"mappings": {
"address": {
"properties": {
"region": {
"type": "text",
"analyzer": "address_analyzer"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"address_analyzer": {
"tokenizer": "address_tokenizer"
}
},
"tokenizer": {
"address_tokenizer": {
"type": "ngram",
"min_gram": 4,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}

Related

How to match partial words in elastic search text search

I have a field name in my elastic search with a value of Single V
Now if i search it with a value of S or Sing , i don't get no result , but if i enter a full value Single , then i get the result Single V, the query i am using is as following :-
{
"query": {
"match": {
"name": "singl"
}
},
"sort": []
}
This gives me no results , do i need to change the mapping/setting for name or analyzer ?
EDIT:-
I am trying to create the following index with the following mapping/setting
PUT my_cars
{
"settings": {
"analysis": {
"normalizer": {
"sortable": {
"filter": ["lowercase"]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 36,
"token_chars": [
"letter"
]
}
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "sortable"
}
}
}
}
}
}
But i get the following error
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "analyzer [tokenizer] must specify either an analyzer type, or a tokenizer"
}
],
"type" : "illegal_argument_exception",
"reason" : "analyzer [tokenizer] must specify either an analyzer type, or a tokenizer"
},
"status" : 400
}
Elasticsearch by default uses a standard analyzer for the text field if no analyzer is specified. This will tokenize "Single V" into "single" and "v". Due to this, you are getting the result for "Single" and not for the other terms.
If you want to do a partial search, you can use edge n-gram tokenizer or a Wildcard query
The mapping for the Edge n-gram tokenizer would be
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 6,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Update 1:
In the index mapping given above, there is one bracket } missing. Modify your index mapping as shown below
{
"settings": {
"analysis": {
"normalizer": {
"sortable": {
"filter": [
"lowercase"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
}, // note this
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 36,
"token_chars": [
"letter"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "sortable"
}
}
}
}
}
}
This is because of the default analyzer. The field is broken into tokens because of the analyzer - [Single,V].
Match query will try to find an exact search of any of the query tokens. Since you are only passing Singl that will be the only token, which is not matching any of the two tokens which are saved in the DB.
{
"query": {
"wildcard": {
"user.id": {
"name": "*singl*"
}
}
}
}
You can use wildcard queries

elasticsearch ngram backslash escape

I am using a ngram analysier on my elastic search index. This is needed for the search capablity I require. I am searching for a document with a name called "l/test_V0001". When I search using "l/test" i am only getting results for "l" the / is working as a escape character and not as a text. I have searched and found this is a common issue and expected but can find no work around.
When i search the API for "l/test_V0001" I can find the result I am after. However when doing the same search via the java API I still only get results for "l".
here is the API search:
{
"query": {
"multi_match": {
"query": "l/test_V0001",
"fields": ["name", "name.partial", "name.text"]
}
}
}
and the mapping for the index:
{
"settings": {
"index": {
"max_ngram_diff": 20,
"search.idle.after": "10m"
},
"analysis": {
"analyzer": {
"ngram3_analyzer": {
"tokenizer": "ngram3_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"ngram3_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20
}
}
}
},
"mappings": {
"dynamic": "strict",
"properties": {
"name": {
"type": "keyword",
"fields": {
"partial": {
"type": "text",
"analyzer": "ngram3_analyzer",
"search_analyzer": "keyword"
},
"text": {
"type": "text"
}
}
},
"value": {
"type": "integer"
}
}
}
}
any help on this or a work around would be great!
so after a bit of digging I found the answer using custom token chars. This is has added to the index mapping:
"tokenizer": {
"ngram3_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"symbol",
"custom"
],
"custom_token_chars": "/"
}
so my full index now looks like:
{
"settings": {
"index": {
"max_ngram_diff": 20,
"search.idle.after": "10m"
},
"analysis": {
"analyzer": {
"ngram3_analyzer": {
"tokenizer": "ngram3_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"ngram3_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"symbol",
"custom"
],
"custom_token_chars": "/"
}
}
}
},
"mappings": {
"dynamic": "strict",
"properties": {
"name": {
"type": "keyword",
"fields": {
"partial": {
"type": "text",
"analyzer": "ngram3_analyzer",
"search_analyzer": "keyword"
},
"text": {
"type": "text"
}
}
},
"value": {
"type": "integer"
}
}
}
}
this works for both rest client and java API

Nest query for elastic n gram filter and analyzer

I have created filters and analyzers on my elastic db. Can i use them directly or do i need to provide settings in my Nest query.
POST /myindextest/_close
PUT myindextest/_settings
{
"settings": {
"analysis": {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
}
},
"analyzer": {
"trigrams": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trigrams_filter"
]
}
}
}
},
"mappings": {
"myindextest": {
"properties": {
"name": {
"type": "text",
"analyzer": "trigrams"
}
}
}
}
}
POST /myindextest/_open
If i have to provide them in C# using NEST then how filters and analyzers should be created and assigned to specific queries? Where i can find a better documentation for Elastic NEST.

Multiple tokenizers inside one Custom Analyser in Elasticsearch

I am using Custom NGRAM Analyzer which has a ngram tokenizer. I have also used lowercase filter. The query is working fine for searches without characters. But when I am searching for certain symbols, it fails. Since I have used lower case tokenizers, Elasticsearch doesn't analyse symbols. I know whitespace tokenizer can help me solve the issue. How can I use two tokenizers in a single analyzer?Below is the mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer":"my_tokenizer",
"filter":"lowercase"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
Is there a way I could solve this issue?
As per the documentation of elasticsearch,
An analyzer must have exactly one tokenizer.
However, you can have multiple analyzer defined in settings, and you can configure separate analyzer for each field.
If you want to have single field itself to be used using different analyzer, one of the option is to make that field multi-field as per this link
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "whitespace"
"fields": {
"ngram": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
}
So if you configure as above your query need to make use of title and title.ngram fields.
GET my_index/_search
{
"query": {
"multi_match": {
"query": "search ##$ whatever",
"fields": [
"title",
"title.ngram"
],
"type": "most_fields"
}
}
}
As another option, here is what you can do
Create two indexes.
The first index have field title with analyzer my_analyzer
Second index have field title with analyzer whitespace
Have same alias created for both of them as below
Execute the below:
POST _aliases
{
"actions":[
{
"add":{
"index":"index A",
"alias":"index"
}
},
{
"add":{
"index":"index B",
"alias":"index"
}
}
]
}
So when you eventually write a query, it must be pointing to this alias which in turn would be querying multiple indexes.
Hope this helps!
if you want to use 2 tokenizers, You should have 2 analyzers.
something like this:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer":"my_tokenizer",
"filter":"lowercase"
},
"my_analyzer_2": {
"tokenizer":"whitespace",
"filter":"lowercase"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
In general, you should also pay attention to the place of the analyzer in the Mapping.
Sometimes it is necessary to have the analyzer in both the serach_time and index_time.
"mappings":{
"_doc":{
"properties":{
"title":{
"type":"text",
"analyzer":"my_analyzer",
"search_analyzer":"my_analyzer"
}
}
}
}
1) You can try with updating your token_chars like below:
"token_chars":[
"letter",
"digit",
"symbol",
"punctuation"
]
2) If not work then try below analyzer:
{
"settings":{
"analysis":{
"filter":{
"my_filter":{
"type":"ngram",
"min_gram":3,
"max_gram":3,
"token_chars":[
"letter",
"digit",
"symbol",
"punctuation"
]
}
},
"analyzer":{
"my_analyzer":{
"type":"custom",
"tokenizer":"keyword",
"filter":[
"lowercase",
"like_filter"
]
}
}
}
},
"mappings":{
"_doc":{
"properties":{
"title":{
"type":"text",
"analyzer":"my_analyzer"
}
}
}
}
}
You need to use keyword tokenizer (keyword tokenizer) and then ngram token filter in your analyzer.
I find a solution for use synonym and edge_ngram tokenizer at the same time.
The main logic works like this => It uses one at indexing time and one at search time.
PUT test_synonym_and_autocomplete
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 10
},
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
},
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"kstem",
"synonym"
]
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "synonym_analyzer"
}
}
}
}
POST test_synonym_and_autocomplete/_doc
{
"text": "Quick Brown Fox"
}
POST test_synonym_and_autocomplete/_doc
{
"text": "B3Round"
}
GET test_synonym_and_autocomplete/_search
{
"query": {
"match": {
"text": {
"query": "test"
}
}
}
}
GET test_synonym_and_autocomplete/_search
{
"query": {
"match": {
"text": {
"query": "qui"
}
}
}
}
Results:

Why does my Elasticsearch multi-match query look only for prefixes?

I am trying to write an Elasticsearch multi-match query (with the Java API) to create a "search-as-you-type" program. The query is applied to two fields, title and description, which are analyzed as ngrams.
My problem is, it seems that Elasticsearch tries to find only words beginning like my query. For instance, if I search for "nut", then it matches with documents featuring "nut", "nuts", "Nutella", etc, but it does not match documents featuring "walnut", which should be matched.
Here are my settings :
{
"index": {
"analysis": {
"analyzer": {
"edgeNGramAnalyzer": {
"tokenizer": "edgeTokenizer",
"filter": [
"word_delimiter",
"lowercase",
"unique"
]
}
},
"tokenizer": {
"edgeTokenizer": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "8",
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
Here is the relevant part of my mapping :
{
"content": {
"properties": {
"title": {
"type": "text",
"analyzer": "edgeNGramAnalyzer",
"fields": {
"sort": {
"type": "keyword"
}
}
},
"description": {
"type": "text",
"analyzer": "edgeNGramAnalyzer",
"fields": {
"sort": {
"type": "keyword"
}
}
}
}
}
}
And here is my query :
new MultiMatchQueryBuilder(query).field("title", 3).field("description", 1).fuzziness(0).tieBreaker(1).minimumShouldMatch("100%")
Do you have any idea what I could be doing wrong ?
That's because you're using an edgeNGram tokenizer instead of nGram one. The former only indexes prefixes, while the latter will index prefixes, suffixes and also sub-parts of your data.
Change your analyzer definition to this instead and it should work as expected:
{
"index": {
"analysis": {
"analyzer": {
"edgeNGramAnalyzer": {
"tokenizer": "edgeTokenizer",
"filter": [
"word_delimiter",
"lowercase",
"unique"
]
}
},
"tokenizer": {
"edgeTokenizer": {
"type": "nGram", <---- change this
"min_gram": "3",
"max_gram": "8",
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}

Resources