custom tokenizer without using built-in token filters - elasticsearch

How to create a custom tokenizer without using default built-in token filters?. e.g: Text: "Samsung Galaxy S9"
I want to tokenize this text such that it should be indexed like this
["samsung", "galaxy", "s9", "samsung galaxy s9", "samsung s9", "samsung galaxy" , "galaxy s9"].
How would I do that?

PUT testindex
{
"settings": {
"analysis": {
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 20,
"min_shingle_size": 2,
"output_unigrams": "true"
}
},
"analyzer": {
"analyzer_shingle": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"filter_shingle"
]
}
}
}
},
"mappings": {
"product": {
"properties": {
"title": {
"analyzer": "analyzer_shingle",
"search_analyzer": "standard",
"type": "text"
}
}
}
}
}
POST testindex/product/1
{
"title": "Samsung Galaxy S9"
}
GET testindex/_analyze
{
"analyzer": "analyzer_shingle",
"text": ["Samsung Galaxy S9"]
}
You can find more about the shingles here and here
The first example is great and it covers a lot. If you want to use the standard tokenizer and not the whitespace, then you'll have to take care of the stop words as the blog post describes. Both of the urls are official ES sources

Related

Configure highlighted part in the elasticsearch

Main question
The user is looking for a name and enters the part of the it, let's say au, and the document with the text paul is found.
I would like to have the doc highlighted like p<em>au</em>l.
How can I achieve it if I have a complex search query (combination of match, prefix, wildcard to rule relevance)?
Sub question
When do highlight settings from documentation for type, boundary_scanner and boundary_chars come into play? As per my tests described below, these settings don't change highlighted part.
Try 1: Wildcard query with default analyzer
PUT myindex
{
"mappings": {
"properties": {
"name": {
"type": "text",
"term_vector": "with_positions_offsets"
}
}
}
}
POST myindex/_doc/1
{
"name": "paul"
}
GET myindex/_search
{
"query": {
"wildcard": {"name": "*au*"}
},
"highlight": {
"fields": {
"name": {}
},
"type": "fvh",
"boundary_scanner": "chars",
"boundary_chars": "abcdefghijklmnopqrstuvwxyz.,!? \t\n"
}
}
This kind of search returns highlight <em>paul</em> but I need to get p<em>au</em>l.
Try 2: Match query with NGRAM analyzer
This one works as described in SO question: Highlighting part of word in elasticsearch
PUT myindexngram
{
"settings": {
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "3",
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"index_ngram_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase"
]
},
"search_term_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "index_ngram_analyzer",
"term_vector": "with_positions_offsets"
}
}
}
}
POST myindexngram/_doc/1
{
"name": "paul"
}
GET myindexngram/_search
{
"query": {
"match": {"name": "au"}
},
"highlight": {
"fields": {
"name": {}
}
}
}
This highlights p<em>au</em>l as desired but:
Highlighting depends on the query type, so combining match and wildcard will again result in <em>paul</em>.
Highlighting is not affected at all on type, boundary_scanner and boundary_chars settings.
Elastic version 7.13.4
Response from Elasticsearch team:
A highlighter works on terms, so only full terms can be highlighted - whatever are the terms in your index. In your second example, au could be highlighted, because it it a term in the index, which is not the case for your first example.
There is also an option to define your own highlight_query that could be different from the main query, but this could lead to unpredictable highlights.
https://discuss.elastic.co/t/configure-highlighted-part/295164

Elastic search partial substring search

I am trying to implement partial substring search in elastic serach 7.1 using following analyzer
PUT my_index-001
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete"
]
},
"autocomplete_search": {
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
},
"filter": {
"autocomplete": {
"type": "nGram",
"min_gram": 2,
"max_gram": 40
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
After that i tried adding some sample data to my_index-001 and type doc
PUT my_index-001/doc/1
{
"title": "ABBOT Series LTD 2014"
}
PUT my_index-001/doc/2
{
"title": "ABBOT PLO LTD 2014A"
}
PUT my_index-001/doc/3
{
"title": "ABBOT TXT"
}
PUT my_index-001/doc/4
{
"title": "ABBOT DMO LTD. 2016-II"
}
Query used to perform partial search :
GET my_index-001/_search
{
"query": {
"match": {
"title": {
"query": "ABB",
"operator": "or"
}
}
}
}
I was expecting the following output from the analyzer
If i type in ABB i should get docid 1,2,3,4
If i type in ABB 2014 i should get docid 1,2
IF i type in ABBO PLO i should get doc 2
If i type in TXT i should get doc 3
With the above analyzer setting i am not getting expected results .
Please let me know if i am missing anything in my analyzer setting of Elastic search
You were almost there but there are a couple of issues.
When creating index mappings through Kibana Dev Tools, there mustn't be any whitespace between the URI and the request body. You have whitespace in the first code snippet which caused ES to ignore the request body entirely! So remove that whitespace.
The maximum ngram difference is set to 1 by default. In order to use your high ngram intervals, you'll need to explicitly increase the index-level setting max_ngram_diff:
PUT my_index-001
{
"settings": {
"index": {
"max_ngram_diff": 40 <--
},
...
}
}
Type names are deprecated in v7. So is the nGram token filter in favor of ngram (lowercase g). And so is the string field type too! Here's the corrected PUT request body:
PUT my_index-001 <--- no whitespace after the URI!
{
"settings": {
"index": {
"max_ngram_diff": 40 <--- explicit setting
},
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete"
]
},
"autocomplete_search": {
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
},
"filter": {
"autocomplete": {
"type": "ngram", <--- ngram, not nGram
"min_gram": 2,
"max_gram": 40
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text", <--- text, not string
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
Since different mapping types had been deprecated in favor of the generic _doc type, you'll need to adjust the way you insert documents. The only difference, luckily, is changing doc to _doc in the URI:
PUT my_index-001/_doc/1
{ "title": "ABBOT Series LTD 2014" }
PUT my_index-001/_doc/2
{ "title": "ABBOT PLO LTD 2014A" }
PUT my_index-001/_doc/3
{ "title": "ABBOT TXT" }
PUT my_index-001/_doc/4
{ "title": "ABBOT DMO LTD. 2016-II" }
Finally, your query is perfectly fine and should behave the way you expect it to. The only thing to change is the operator to and when querying for two or more substrings, i.e.:
GET my_index-001/_search
{
"query": {
"match": {
"title": {
"query": "ABB 2014",
"operator": "and"
}
}
}
}
Other than that, all four of your test scenarios should return what you expect.

Right way for user search by partial username or name using ngram tokenizer in elasticsearch

I want to create search fature for social networking application in such a way that users can search other users by username or name even by inputting part of username or name using elasticsearch.
For example:
input: okma
result: {"username": "alokmahor", "name": "Alok Singh Mahor"} // partial match in username
input: m90
result: {"username": "ram9012", "name": "Ram Singh"} // partial match in username
input: shn
result: {"username": "r2020", "name": "Krishna Kumar"} // partial match with name
After reading and playing these links I come up with my partial solution which I am not sure if thats the correct way.
I followed
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
How to search for a part of a word with ElasticSearch
My solution is
DELETE my_index
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"username": { "type": "text", "analyzer": "my_analyzer" },
"name": { "type": "text", "analyzer": "my_analyzer" }
}
}
}
PUT /my_index/_doc/1
{
"username": "alokmahor",
"name": "Alok Singh Mahor"
}
PUT /my_index/_doc/2
{
"username": "ram9012",
"name": "Ram Singh"
}
PUT /my_index/_doc/3
{
"username": "r2020",
"name": "Krishna Kumar"
}
GET my_index/_search
{
"query": {
"multi_match": {
"query": "shn",
"analyzer": "my_analyzer",
"fields": ["username", "name"]
}
}
}
somehow this solution is partailly working and I am not sure if this is really a correct way as I got this after playing aorund elasticsearch features and copy pasting example code. So please suggest correct way or improvement on this.
Things which are not working
// "sin" is not matching with "Singh" but "Sin" is matching and working.
GET my_index/_search
{
"query": {
"multi_match": {
"query": "sin",
"analyzer": "my_analyzer",
"fields": ["username", "name"]
}
}
}
So please suggest correct way
The degree of correctness can only be defined by your requirement. You can keep on refining by checking all the possible use cases one by one.
improvement on this
For the problem you mention where Sin is matching while sin is not; this is because the analyzer defined doesn't make the search case-insensitive. To do so add lowercase filter in your analyzer definition as below:
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter": [
"lowercase"
]
}
}
This answer can help you understand more in case-insensitive search.

Elastic Search - how to use language analyzer with UTF-8 filter?

I have a problem with ElasticSearch language analyzer. I am working on Lithuanian language, so I am using Lithuanian language analyzer. Analyzer works fine and I got all word cases I need. For example, I index Lithuania city "Klaipėda":
PUT /cities/city/1
{
"name": "Klaipėda"
}
Problem is that I also need to get a result, when I am searching "Klaipėda" only in Latin alphabet ("Klaipeda") and in all Lithuanian cases:
Nomanitive case: "Klaipeda"
Genitive case: "Klaipedos"
...
Locative case: "Klaipedoje"
"Klaipėda", "Klaipėdos", "Klaipėdoje" - works, but "Klaipeda", "Klaipedos", "Klaipedoje" - not.
My index:
PUT /cities
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"fields": {
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"md_folded_analyzer": {
"type": "lithuanian",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"lithuanian_stop",
"lithuanian_keywords",
"lithuanian_stemmer"
]
}
}
}
}
}
and search query:
GET /cities/_search
{
"query": {
"multi_match" : {
"type": "most_fields",
"query": "klaipeda",
"fields": [ "name", "name.folded" ]
}
}
}
What I am doing wrong? Thanks for help.
The technique you are using here is so-called multi-fields. The limitation of the underlying name.folded field is that you can't perform search against it - you can perform only sorting by name.folded and aggregation.
To make a way round this I've come up with the following set-up:
Separate fields set-up (to eliminate duplicates - just specify copy_to):
curl -XPUT http://localhost:9200/cities -d '
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"copy_to": "folded",
},
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}'
Change the type of your analyzer to custom as it described here, because otherwise the asciifolding is not got into the config. And more important - asciifolding should go after all stemming / stop-words in Lithuanian language, because after folding the word can miss desired sense.
curl -XPUT http://localhost:9200/my_cities -d '
{
"settings": {
"analysis": {
"filter": {
"lithuanian_stop": {
"type": "stop",
"stopwords": "_lithuanian_"
},
"lithuanian_stemmer": {
"type": "stemmer",
"language": "lithuanian"
}
},
"analyzer": {
"md_folded_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"lithuanian_stop",
"lithuanian_stemmer",
"asciifolding"
]
}
}
}
}
}
Sorry I've eliminated lithuanian_keywords - it requires additional set-up, which I missed here. But I hope you've got the idea.

"Letter" tokenizer and "word_delimiter" filter not working with underscores

I built an ElasticSearch index using a custom analyzer which uses letter tokenizer and lower_case and word_delimiter token filters. Then I tried searching for documents containing underscore-separated sub-words, e.g. abc_xyz, using only one of the sub-words, e.g. abc, but it didn't come back with any result. When I tried the full-word, i.e. abc_xyz, it did find the document.
Then I changed the document to have dash-separated sub-words instead, e.g. abc-xyz and tried to search by sub-words again and it worked.
To try to understand what is going on, I thought I would check the terms generated for my documents using _termvector service, and the result was identical for both, the underscore-separated sub-words and the dash-separated sub-words, so really I expect the result of searching to be identical in both cases.
Any idea what I could be doing wrong?
If it helps, this is the settings I used for my index:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"cmt_value_analyzer": {
"tokenizer": "letter",
"filter": [
"lowercase",
"my_filter"
],
"type": "custom"
}
},
"filter": {
"my_filter": {
"type": "word_delimiter"
}
}
}
}
},
"mappings": {
"alertmodel": {
"properties": {
"name": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"productId": {
"type": "double"
},
"productName": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"link": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"updatedOn": {
"type": "date"
}
}
}
}
}

Resources