Elasticsearch : using fuzzy search to find abbreviations - elasticsearch

I have indexed textual articles which mentions company names, like apple and lemonade, and am trying to search for these companies using their abbreviations like APPL and LMND but fuzzy search is giving other results, for example, searching with LMND gives land which is mentioned in the text but it doesn't output lemonade whichever parameters I tried.
First question
Is fuzzy search the suitable solution for such search ?
Second question
what could be a good parameter values ranges to support my problem ?
UPDATE
I have tried synonym filter
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonyms_filter": {
"type": "synonym",
"synonyms": [
"apple,APPL",
"lemonade,LMND"
]
}
},
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonyms_filter"
]
}
}
}
}
},
"mappings": {
"properties": {
"transcript_data": {
"properties": {
"words": {
"type": "nested",
"properties": {
"word": {
"type": "text",
"search_analyzer":"synonym_analyzer"
}
}
}
}
}
}
}
}
and for SEARCH I used
{
"_source": false,
"query": {
"nested": {
"path": "transcript_data.words",
"query": {
"match": {
"transcript_data.words.word": "lmnd"
}
}
}
}
}
but it's not working

I believe that the best option for you is the use of synonyms, they serve exactly what you need.
I'll leave an example and the link to an article explaining some details.
PUT teste
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonyms_filter": {
"type": "synonym",
"synonyms": [
"apple,APPL",
"lemonade,LMND"
]
}
},
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonyms_filter"
]
}
}
}
}
},
"mappings": {
"properties": {
"transcript_data": {
"properties": {
"words": {
"type": "nested",
"properties": {
"word": {
"type": "text",
"analyzer":"synonym_analyzer"
}
}
}
}
}
}
}
}
POST teste/_bulk
{"index":{}}
{"transcript_data": {"words":{"word":"apple"}}}
GET teste/_search
{
"query": {
"nested": {
"path": "transcript_data.words",
"query": {
"match": {
"transcript_data.words.word": "appl"
}
}
}
}
}

Related

ElasticSearch Search-as-you-type field type field with partial search

I recently updating my ngram implementation settings to use Search-as-you-type field type.
https://www.elastic.co/guide/en/elasticsearch/reference/7.x/search-as-you-type.html
This worked great but I noticed that partial searching does not work.
If I search for number 00060434 I get the desired result but I would also like to be able to search for 60434, then it should return document 3.
Is there a way todo it with the Search-as-you-type field type or can i only do this with ngrams?
PUT searchasyoutype_example
{
"settings": {
"analysis": {
"analyzer": {
"englishAnalyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"ascii_folding"
]
}
},
"filter": {
"ascii_folding": {
"type": "asciifolding",
"preserve_original": true
}
}
}
},
"mappings": {
"properties": {
"number": {
"type": "search_as_you_type",
"analyzer": "englishAnalyzer"
},
"fullName": {
"type": "search_as_you_type",
"analyzer": "englishAnalyzer"
}
}
}
}
PUT searchasyoutype_example/_doc/1
{
"number" : "00069794",
"fullName": "Employee 1"
}
PUT searchasyoutype_example/_doc/2
{
"number" : "00059840",
"fullName": "Employee 2"
}
PUT searchasyoutype_example/_doc/3
{
"number" : "00060434",
"fullName": "Employee 3"
}
GET searchasyoutype_example/_search
{
"query": {
"multi_match": {
"query": "00060434",
"type": "bool_prefix",
"fields": [
"number",
"number._index_prefix",
"fullName",
"fullName._index_prefix"
]
}
}
}
I think you need to query on number,number._2gram & number._3gram like below:
GET searchasyoutype_example/_search
{
"query": {
"multi_match": {
"query": "00060434",
"type": "bool_prefix",
"fields": [
"number",
"number._2gram",
"number._3gram",
]
}
}
}
search_as_you_type creates the 3 sub fields. You can check more on this article how it works:
https://ashish.one/blogs/search-as-you-type/

Elastic synonyms are taking over other words

On this sequence of commands:
Create the index:
PUT /test_index?
{
"settings": {
"analysis": {
"analyzer": {
"GermanCompoundWordsAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"german_compound_synonym",
"german_normalization"
]
}
},
"filter": {
"german_compound_synonym": {
"type": "synonym",
"synonyms": [
"teppichläufer, auslegware läufer"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "GermanCompoundWordsAnalyzer"
}
}
}
}
}
Adding a few documents:
POST test_index/_doc/
{
"sku" : "kimchy",
"name" : "teppichläufer alfa"
}
POST test_index/_doc/
{
"sku" : "kimchy",
"name" : "teppichläufer beta"
}
Search for one document (I would expect), but 2 are returning :(
GET /test_index/_search
{
"query": {
"match": {
"name": {
"query": "teppichläufer beta",
"operator": "and"
}
}
}
}
I will get both documents since the synonym teppichläufer, auslegware läufer, läufer will endup on the position 1 and 'substitute' the beta. If I remove the "analyzer": "GermanCompoundWordsAnalyzer", I will just get one document as expected.
How do I use this synonyms and don't have this issue?
POST /test_index/_search
{
"query": {
"bool" : {
"should": [
{
"query_string": {
"default_field": "name",
"query": "teppichläufer beta"
, "default_operator": "AND"
}
}
]
}
}
}
After a little more search I found it on the documentations. This a RFM problems, sorry guys.
I tried with:
https://www.elastic.co/guide/en/elasticsearch/reference/master/analysis-synonym-graph-tokenfilter.html
The funny part is that it makes the NDCG of the results worst :)

Custom analyzer, use case : zip-code [ElasticSearch]

Let be a set index/type named customers/customer.
Each document of this set has a zip-code as property.
Basically, a zip-code can be like:
String-String (ex : 8907-1009)
String String (ex : 211-20)
String (ex : 30200)
I'd like to set my index analyzer to get as many documents as possible that could match. Currently, I work like that :
PUT /customers/
{
"mappings":{
"customer":{
"properties":{
"zip-code": {
"type":"string"
"index":"not_analyzed"
}
some string properties ...
}
}
}
When I search a document I'm using that request :
GET /customers/customer/_search
{
"query":{
"prefix":{
"zip-code":"211-20"
}
}
}
That works if you want to search rigourously. But for instance if the zip-code is "200 30", then searching with "200-30" will not give any results.
I'd like to give orders to my index analyser in order to don't have this problem.
Can someone help me ?
Thanks.
P.S. If you want more information, please let me know ;)
As soon as you want to find variations you don't want to use not_analyzed.
Let's try this with a different mapping:
PUT zip
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"zip_code": {
"tokenizer": "standard",
"filter": [ ]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"zip": {
"type": "text",
"analyzer": "zip_code"
}
}
}
}
}
We're using the standard tokenizer; strings will be broken up at whitespaces and punctuation marks (including dashes) into tokens. You can see the actual tokens if you run the following query:
POST zip/_analyze
{
"analyzer": "zip_code",
"text": ["8907-1009", "211-20", "30200"]
}
Add your examples:
POST zip/_doc
{
"zip": "8907-1009"
}
POST zip/_doc
{
"zip": "211-20"
}
POST zip/_doc
{
"zip": "30200"
}
Now the query seems to work fine:
GET zip/_search
{
"query": {
"match": {
"zip": "211-20"
}
}
}
This will also work if you just search for "211". However, this might be too lenient, since it will also find "20", "20-211", "211-10",...
What you probably want is a phrase search where all the tokens in your query need to be in the field and also in the right order:
GET zip/_search
{
"query": {
"match_phrase": {
"zip": "211"
}
}
}
Addition:
If the ZIP codes have a hierarchical meaning (if you have "211-20" you want this to be found when searching for "211", but not when searching for "20"), you can use the path_hierarchy tokenizer.
So changing the mapping to this:
PUT zip
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"zip_code": {
"tokenizer": "zip_tokenizer",
"filter": [ ]
}
},
"tokenizer": {
"zip_tokenizer": {
"type": "path_hierarchy",
"delimiter": "-"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"zip": {
"type": "text",
"analyzer": "zip_code"
}
}
}
}
}
Using the same 3 documents from above you can use the match query now:
GET zip/_search
{
"query": {
"match": {
"zip": "1009"
}
}
}
"1009" won't find anything, but "8907" or "8907-1009" will.
If you want to also find "1009", but with a lower score, you'll have to analyze the zip code with both variations I have shown (combine the 2 versions of the mapping):
PUT zip
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"zip_hierarchical": {
"tokenizer": "zip_tokenizer",
"filter": [ ]
},
"zip_standard": {
"tokenizer": "standard",
"filter": [ ]
}
},
"tokenizer": {
"zip_tokenizer": {
"type": "path_hierarchy",
"delimiter": "-"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"zip": {
"type": "text",
"analyzer": "zip_standard",
"fields": {
"hierarchical": {
"type": "text",
"analyzer": "zip_hierarchical"
}
}
}
}
}
}
}
Add a document with the inverse order to properly test it:
POST zip/_doc
{
"zip": "1009-111"
}
Then search both fields, but boost the one with the hierarchical tokenizer by 3:
GET zip/_search
{
"query": {
"multi_match" : {
"query" : "1009",
"fields" : [ "zip", "zip.hierarchical^3" ]
}
}
}
Then you can see that "1009-111" has a much higher score than "8907-1009".

How to use nested mapping in language analyzer

I am presently working with language analyzer in elasticsearch. In this I found that if we need to use the analyzer for searching documents then we need to define mapping along with analyzer.
In my case, if document contains a normal text field this works fine but when I apply same property to a nested field then the analyzer is not working.
This is code for language analyzer
PUT checkmap
{
"settings": {
"analysis": {
"analyzer": {
"stemmerenglish": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"my_stemmer"
]
}
},
"filter": {
"my_stemmer": {
"type": "stemmer",
"name": "english"
}
}
}
},
"mappings": {
"dd": {
"properties": {
"Courses": {
"type": "nested",
"properties": {
"Sname": {
"type": "text",
"analyzer": "stemmerenglish",
"search_analyzer": "stemmerenglish"
}
}
}
}
}
}
}
Please help me out with above problem.
You have to use Nested Query for nested type. Use following Query
GET checkmap/_search
{
"query": {
"nested": {
"path": "Courses",
"query": {
"match": {
"Courses.Sname": {
"query": "Jump"
}
}
}
}
}
}
Read more here

Search irrespective vietnamese character vs english character

I want to search out results irrespective marked or unmarked.
For example: I want to find the words "rồng phượng", but when i typed "rong", "rong phuong", "phuong", "rồng phuong", "rong phượng"..., i all get right results.
I think you need the icu_folding token filter:
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "icu_tokenizer",
"filter": [ "icu_folding", "lowercase" ]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}
And then use a simple match query:
GET /my_index/my_type/_search
{
"query": {
"match": {
"text": "phượng"
}
}
}

Resources