elasticsearch: how to map common character mistakes? - elasticsearch

I would like to map common mistakes in my language, as:
xampu -> shampoo
Shampoo is an english word, but commonly used in Brazil. In Portuguese, "ch" sounds like "x", as sometimes "s" sounds like "z". We also do not have "y" on our language, but it's common on names and foreign words - it sounds like "i".
So I would like to map a character replacement, but also keep the original word on the same position.
So a mapping table would be:
ch -> x
sh -> x
y -> i
ph -> f
s -> z
I have taken a look on the "Character Filters", but it seems to only support replacement:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html
I want to form derivative words based on the original so users can find the correct word even if typed wrong. To archive this, the following product name:
SHAMPOO NIVEA MEN
Should be tokenized as:
0: SHAMPOO, XAMPOO
1: NIVEA
2: MEN
I am using the synonym filter, but with synonym I need to may every word.
Any way to do this?
Thanks.

For your usecase, Multi-Field seems to suit the best. You can keep your field analyzed in two ways, one using standard and other using your custom analyzer created using mapping Char Filter.
It would look like:
Index Creation
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"ch => x",
"sh => x",
"y => i",
"ph => f",
"s => z"
]
}
}
}
}
}
MultiField creation
POST my_index/_mapping/my_type
{
"properties": {
"field_name": {
"type": "text",
"analyzer": "standard",
"fields": {
"mapped": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
Above mapping would create two versions of field_name, one which is analyzed ising standard analyzer, another which is analyzed using your custom analyzer created.
In order to Query Both you can use should on both versions.
GET my_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"field_name": "xampoo"
}
},
{
"match": {
"field_name.mapped": "shampoo"
}
}
]
}
}
}
Hope this helps you!!

Related

Elasticsearch Became case sensitive after add synonym analyzer

After I added synonym analyzer to my_index, the index became case-sensitive
I have one property called nationality that has synonym analyzer. But it seems that this property become case sensitive because of the synonym analyzer.
Here is my /my_index/_mappings
{
"my_index": {
"mappings": {
"items": {
"properties": {
.
.
.
"nationality": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "synonym"
},
.
.
.
}
}
}
}
}
Inside the index, i have word India COUNTRY. When I try to search India nation using the command below, I will get the result.
POST /my_index/_search
{
"query": {
"match": {
"nationality": "India nation"
}
}
}
But, when I search for india (notice the letter i is lowercase), I will get nothing.
My assumption is, this happend because i put uppercase filter before the synonym. I did this because the synonyms are uppercased. So the query India will be INDIA after pass through this filter.
Here is my /my_index/_settings
{
"my_index": {
"settings": {
"index": {
"number_of_shards": "1",
"provided_name": "my_index",
"similarity": {
"default": {
"type": "BM25",
"b": "0.9",
"k1": "1.8"
}
},
"creation_date": "1647924292297",
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"lenient": "true",
"synonyms": [
"NATION, COUNTRY, FLAG"
]
}
},
"analyzer": {
"synonym": {
"filter": [
"uppercase",
"synonym"
],
"tokenizer": "whitespace"
}
}
},
"number_of_replicas": "1",
"version": {
"created": "6080099"
}
}
}
}
}
Is there a way so I can make this property still case-insensitive. All the solution i've found only shows that I should only either set all the text inside nationality to be lowercase or uppercase. But how if I have uppercase & lowercase letters inside the index?
Did you apply synonym filter after adding your data into index?
If so, probably "India COUNTRY" phrase was indexed exactly as "India COUNTRY". When you sent a match query to index, your query was analyzed and sent as "INDIA COUNTRY" because you have uppercase filter anymore, it is matched because you are using match query, it is enough to match one of the words. "COUNTRY" word provide this.
But, when you sent one word query "india" then it is analyzed and converted to "INDIA" because of your uppercase filter but you do not have any matching word on your index. You just have a document contains "India COUNTRY".
My answer has a little bit assumption. I hope that it will be useful to understand your problem.
I have found the solution!
I didn't realize that the filter that I applied in the settings is applicable while updating and searching the data. At first, I did this step:
Create index with synonym filter
Insert data
Add uppercase before synonym filter
By doing that, the uppercase filter is not applied to my data. What I should've done are:
Create index with uppercase & synonym filter (pay attention to the order)
Insert data
Then the filter will be applied to my data.

Elasticsearch - Do searches for alternative country codes

I have a document with a field called 'countryCode'. I have a term query that search for the keyword value of it. But having some issues with:
Some records saying UK and some other saying GB
Some records saying US and some other USA
And the list goes on..
Can I instruct my index to handle all those variations somehow, instead of me having to expand the terms on my query filter?
What you are looking for is a way to have your tokens understand similar tokens which may or may not be having similar characters. This is only possible using synonyms.
Elasticsearch provides you to configure your synonyms and have your query use those synonyms and return the results accordingly.
I have configured a field using a custom analyzer using synonym token filter. I have created a sample mapping and query so that you can play with it and see if that fits your needs.
Mapping
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"usa, us",
"uk, gb"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_synonyms"
}
}
}
}
}
Sample Document
POST my_index/mydocs/1
{
"name": "uk is pretty cool country"
}
And when you make use of the below query, it does return the above document as well.
Query
GET my_index/mydocs/_search
{
"query": {
"match": {
"name": "gb"
}
}
}
Refer to their official documentation to understand more on this. Hope this helps!
Handling within ES itself without using logstash, I'd suggest using a simple ingest pipeline with gsub processor to update the field in it's place
{
"gsub": {
"field": "countryCode",
"pattern": "GB",
"replacement": "UK"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/master/gsub-processor.html

Tokenize a big word into combination of words

Suppose I have Super Bowl is the value of a document's property in the elasticsearch. How can the term query superbowl match Super Bowl?
I read about letter tokenizer and word delimiter but both don't seem to solve my problem. Basically I want to be able to convert combination of a large word into meaningful combination of words.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-letter-tokenizer.html
I know this is quite late but you could use synonym filter
You could define that super bowl is the same as "s bowl", "SuperBowl" etc.
There are ways to do this without changing what you actually index. For example, if you are using at least 5.2 (where normalizers were introduced), but it can also be earlier version but 5.x makes it easier, you can define a normalizer to lowercase your text and not change it and then use a fuzzy query at search time to account for the space between super and bowl. My solution though is specific to this example you have given. As it is with Elasticsearch most of time, one needs to think about what kind of data goes into Elasticsearch and what it is required at search time.
In any case, if you are interested in an approach here it is:
DELETE test
PUT /test
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"test": {
"properties": {
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
}
}
POST test/test/1
{"title":"Super Bowl"}
GET /test/_search
{
"query": {
"fuzzy": {
"title.keyword": "superbowl"
}
}
}

Elasticsearch completion - generating input list with analyzers

I've had a look at this article: https://www.elastic.co/blog/you-complete-me
However, it requires writing some logic in the client to create multiple "input". Is there a way to define an analyzer (maybe using shingle or ngram/edge-ngram) that will generate the multiple terms for input?
Here's what I tried (and it obviously doesn't work):
DELETE /products/
PUT /products/
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"tokenizer": "standard"
}
}
}
},
"mappings": {
"product": {
"properties": {
"name": {"type": "string"
,"copy_to": ["name_suggest"]
}
,"name_suggest": {
"type": "completion",
"payloads": false,
"analyzer": "autocomplete"
}
}
}
}
}
PUT /products/product/1
{
"name": "Apple iPhone 5"
}
PUT /products/product/2
{
"name": "iPhone 4 16GB"
}
PUT /products/product/3
{
"name": "iPhone 3 GS 16GB black"
}
PUT /products/product/4
{
"name": "Apple iPhone 4 S 16 GB white"
}
PUT /products/product/5
{
"name": "Apple iPhone case"
}
POST /products/_suggest
{
"suggestions": {
"text":"i"
,"completion":{
"field": "name_suggest"
}
}
}
Don't think there's a direct way to achieve this.
I'm not sure why it would be needed to store ngrammed tokens considering elasticsearch already stores the 'input' text as an FST structure. New releases also allow for fuzziness in the suggest query.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html#fuzzy
I can understand the need for something like a shingle analyser to generate the inputs for you, but there doesn't seem to be a way yet. Having said that, the _analyze endpoint can be used to generate tokens from the analyzer of your choice and those tokens can be passed to the 'input' field (with or without any other added logic). This way you won't have to replicate your analyzer logic in your application code. That's the only way i can think of to achieve the desired input field.
Hope it helps.

Semi-exact (complete) match in ElasticSearch

Is there a way to require a complete (though not necessarily exact) match in ElasticSearch?
For instance, if a field has the term "I am a little teapot short and stout", I would like to match on " i am a LITTLE TeaPot short and stout! " but not just "teapot short and stout". I've tried the term filter, but that requires an actual exact match.
If your "not necessarily exact" definition refers to uppercase/lowercase letters combination and the punctuation marks (like ! you have in your example), this would be a solution, not too simple and obvious tough:
The mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim",
"my_pattern_replace"
]
}
},
"filter": {
"my_pattern_replace": {
"type": "pattern_replace",
"pattern": "!",
"replacement":""
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_keyword_lowercase"
}
}
}
}
}
The idea here is the following:
use a keyword tokenizer to keep the text as is and not to be split into tokens
use the lowercase filter to get rid of the mixing uppercase/lowercase characters
trim filter used to get rid of the trailing and leading whitespaces
use a pattern_replace filter to get rid of the punctuation. This is like this because a keyword tokenizer won't do anything to the characters inside the text. A standard analyzer will do this, but the standard will, also, split the text whereas you need it as is
And this is the query you would use for the mapping above:
{
"query": {
"match": {
"text": " i am a LITTLE TeaPot short and stout! "
}
}
}

Resources