elasticsearch mutiple word synonms not working - elasticsearch

I am new to elasticsearch and i am trying to configure synonyms but it is not working as expected.
I have following data in my fields
1) Techincal Lead, Module Lead, Software Engineer, Senior Software Engineer
I want if I search for tl then it should retun "Technical Lead" or "tl"
However it is returning me "Technical Lead" and "Module Lead" because lead is tokenized at index tme.
Could you please help me in getting resolve this issue with exact settings.
I have seen that index time and search time tokenization but unable to understand that.
synonyms.txt:
tl,TL => Technical Lead
se,SE => Software Engineer
sse => Senior Software Engineer
Mapping file:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
}
}
}
},
"mappings": {
"tweet": {
"properties": {
"Domain": {
"type": "string",
"analyzer": "synonym"
},
"Designation": {
"analyzer": "synonym",
"type": "string"
},
"City": {
"type": "string",
"analyzer": "synonym"
}
}
}
}
}

Your tokens are identical here, so you have that part down. What you need to do is ensure that you are doing an "AND" match instead of an "or" as it appears to be just matching on any word rather than all.
Check out your tokens:
localhost:9200/test/_analyze?analyzer=synonym&text=technical lead
localhost:9200/test/_analyze?analyzer=synonym&text=tl
And the query
{
"query": {
"match": {
"domain": {
"query": "tl",
"operator": "and"
}
}
}
}
Usually you want your search and index analyzers to be the same. However, there are many advanced examples where this is not preferable. However, in the case with synonyms, often you do not want to use synonyms in one or the other when you have expansions turned on.
i.e. tl,technical lead
However, since you are using => type of synonyms, this really doesn't matter because all words will be converted into the word on the right rather than creating a bunch of tokens for every word between the commas.

Related

Elasticsearch autocomplete on large text field

I am using ES 7.x version and basically requirement is to provide autosuggest / type ahead on large text field which have file/document content.
I have explored multiple way and for all it returns the entire source document or specific field if I restrict using _source. I have tried out edge ngram or n-gram tokenizer, Prefix Query, Completion suggestor.
Below is sample Document (content field might have 1000s sentences):
{
"content":"Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.
Elasticsearch is built on Apache Lucene and was first released in 2010 by Elasticsearch N.V. (now known as Elastic).
Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of free and open tools for data ingestion, enrichment, storage, analysis, and visualization. Commonly referred to as the ELK Stack (after Elasticsearch, Logstash, and Kibana),
the Elastic Stack now includes a rich collection of lightweight shipping agents known as Beats for sending data to Elasticsearch."
}
Below is expected output:
Search query: el
Output: ["elasticsearch","elastic","elk"]
Search Query: analytics e
Output: ["analytics engine"]
Currently I am not able to achieve above output using the OOTB functionality. So I have used highlighting functionality of elasticsearch and applied regex on result and created unique list of suggestion using Java.
Below is my current implement using highlight functionality.
Index Mapping:
PUT index
{
"settings": {
"index": {
"number_of_shards": 2,
"number_of_replicas": 1
},
"analysis": {
"filter": {
"stop_filter": {
"type": "stop",
"stopwords": "_english_"
},
"ngram_filter": {
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
],
"min_gram": "1",
"type": "edge_ngram",
"max_gram": "12"
}
},
"analyzer": {
"text_english": {
"type": "custom",
"tokenizer": "uax_url_email",
"filter": [
"lowercase",
"stop_filter"
]
},
"whitespace_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "whitespace"
},
"ngram_analyzer": {
"filter": [
"lowercase",
"stop_filter",
"ngram_filter"
],
"type": "custom",
"tokenizer": "letter"
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"fields": {
"autocorrect": {
"type": "text",
"analyzer": "ngram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
},
"analyzer": "text_english"
}
}
}
}
Below is Elasticsearch query which executed from Java
POST autosuggest/_search
{
"_source": "content.autocorrect",
"query": {
"match_phrase": {
"content.autocorrect": "analytics e"
}
},
"highlight": {
"fields": {
"content.autocorrect": {
"fragment_size": 500,
"number_of_fragments": 1
}
}
}
}
We have applied regex pattern on above query result.
Please let me know if there is any way to achieve without above workaround.
The completion suggester is not the right tool for your use case. As well ngrams and edge ngrams might be overkill depending on how much content you have.
Have you tried the match_phrase_prefix query which matches full tokens next to one another and the last one as a prefix?
The query below is very simple and should work the way you expect.
POST test/_search
{
"query": {
"match_phrase_prefix": {
"content": "analytics e"
}
}
}

Custom stopword analyzer is not woring properly

I have created an index with a custom analyzer for stop words. I want that elastic-search to ignore these words at the time of searching. Then I added one document data in elasticsearch mapping.
but when I am querying in kibana for "the" keyword with the query. It should not show any successful match, because in my_analzer I have put "the" in my_stop_word section. But it is showing the match. I have studied that if you mention one analyzer at the time of indexing in the mapping field. then it takes that analyzer by default at the time of the query.
please help!
PUT /pandey
{
"settings":
{
"analysis":
{
"analyzer":
{
"my_analyzer":
{
"tokenizer": "standard",
"filter": [
"my_stemmer",
"english_stop",
"my_stop_word",
"lowercase"
]
}
},
"filter": {
"my_stemmer": {
"type": "stemmer",
"name": "english"
},
"english_stop":{
"type": "stop",
"stopwords": "_english_"
},
"my_stop_word": {
"type": "stop",
"stopwords": ["robot", "love", "affection", "play", "the"]
}
}
}
},
"mappings": {
"properties": {
"dialog": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
PUT pandey/_doc/1
{
"dailog" : "the boy is a robot. he is in love. i play cricket"
}
GET pandey/_search
{
"query": {
"match": {
"dailog": "the"
}
}
}
A small spelling mistake can lead to this.
You defined mapping for dialog but added document with field name dailog. the dynamic field mappings behavior of elastic will index it without error. we can disable it though.
So the query, "dailog": "the" will get the result using default analyzer.

Right way for user search by partial username or name using ngram tokenizer in elasticsearch

I want to create search fature for social networking application in such a way that users can search other users by username or name even by inputting part of username or name using elasticsearch.
For example:
input: okma
result: {"username": "alokmahor", "name": "Alok Singh Mahor"} // partial match in username
input: m90
result: {"username": "ram9012", "name": "Ram Singh"} // partial match in username
input: shn
result: {"username": "r2020", "name": "Krishna Kumar"} // partial match with name
After reading and playing these links I come up with my partial solution which I am not sure if thats the correct way.
I followed
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
How to search for a part of a word with ElasticSearch
My solution is
DELETE my_index
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"username": { "type": "text", "analyzer": "my_analyzer" },
"name": { "type": "text", "analyzer": "my_analyzer" }
}
}
}
PUT /my_index/_doc/1
{
"username": "alokmahor",
"name": "Alok Singh Mahor"
}
PUT /my_index/_doc/2
{
"username": "ram9012",
"name": "Ram Singh"
}
PUT /my_index/_doc/3
{
"username": "r2020",
"name": "Krishna Kumar"
}
GET my_index/_search
{
"query": {
"multi_match": {
"query": "shn",
"analyzer": "my_analyzer",
"fields": ["username", "name"]
}
}
}
somehow this solution is partailly working and I am not sure if this is really a correct way as I got this after playing aorund elasticsearch features and copy pasting example code. So please suggest correct way or improvement on this.
Things which are not working
// "sin" is not matching with "Singh" but "Sin" is matching and working.
GET my_index/_search
{
"query": {
"multi_match": {
"query": "sin",
"analyzer": "my_analyzer",
"fields": ["username", "name"]
}
}
}
So please suggest correct way
The degree of correctness can only be defined by your requirement. You can keep on refining by checking all the possible use cases one by one.
improvement on this
For the problem you mention where Sin is matching while sin is not; this is because the analyzer defined doesn't make the search case-insensitive. To do so add lowercase filter in your analyzer definition as below:
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter": [
"lowercase"
]
}
}
This answer can help you understand more in case-insensitive search.

elasticsearch synonyms analyzer gives 0 results

I am using elasticsearch 7.0.0.
I am trying to work on synonyms with this configuration while creating index.
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
}
}
}
},
"mappings": {
"properties": {
"address.state": {
"type": "text",
"analyzer": "synonym"
},
"location": {
"type": "geo_point"
}
}
}
}
Here's a document inserted into the index:
{
"name": "Berry's Burritos",
"description": "Best burritos in New York",
"address": {
"street": "230 W 4th St",
"city": "New York",
"state": "NY",
"zip": "10014"
},
"location": [
40.7543385,
-73.976313
],
"tags": [
"mexican",
"tacos",
"burritos"
],
"rating": "4.3"
}
Also content in synonyms.txt:
ny, new york, big apple
When I tried searching for anything in address.state property, I get empty result.
Here's the query:
{
"query": {
"bool": {
"filter": {
"range": {
"rating": {
"gte": 4
}
}
},
"must": {
"match": {
"address.state": "ny"
}
}
}
}
}
Even with ny (as it is:no synonym) in query, the result is empty.
Before, when I created index without mappings, the query used to give the result, only except for synonyms.
But now with mappings, the result is empty even though the term is present.
This query is working though:
{
"query": {
"query_string": {
"query": "tacos",
"fields": [
"tags"
]
}
}
}
I looked and researched into many articles/tutorials and came up this far.
What am I missing here now?
While indexing you are passing the value as "state":"NY". Notice the case of NY. The analyzer synonym define in the settings has only one filter i.e. synonym. NY doesn't match any set of synonyms in defined in synonym.txt due to case. NOTE that NY isn't equal to ny. To overcome this problem (or we can call making it case insensitive) add lowercase filter before synonym filter to synonym analyzer. This will ensure that any input text is lower cased first and then synonym filter is applied. Same will happen when you search on that field using full text search queries.
So you settings will be as below:
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
}
}
}
}
No changes are required in mapping.
Why it initially worked?
Answer to this is because when you haven't defined any mapping, elastic would map address.state as a text field with no explicit analyzer defined for the field. In such case elasticsearch by default uses standard analyzer which uses lowercase token filter as one of the filters. and hence the query matched the document.

Why match_phrase_prefix query returns wrong results with diffrent length of phrase?

I have very simple query:
POST /indexX/document/_search
{
"query": {
"match_phrase_prefix": {
"surname": "grab"
}
}
}
with mapping:
"surname": {
"type": "string",
"analyzer": "polish",
"copy_to": [
"full_name"
]
}
and definition for index (I use Stempel (Polish) Analysis for Elasticsearch plugin):
POST /indexX
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym" : {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
},
"polish_stop": {
"type": "stop",
"stopwords_path": "analysis/stopwords.txt"
},
"polish_my_stem": {
"type": "stemmer",
"rules_path": "analysis/stems.txt"
}
},
"analyzer": {
"polish_with_synonym": {
"tokenizer": "standard",
"filter": [
"synonym",
"lowercase",
"polish_stop",
"polish_stem",
"polish_my_stem"
]
}
}
}
}
}
}
For this query I get zero results. When I change phrase to GRA or GRABA it returns 1 result (GRABARZ is the surname). Why is this happening?
I tried max_expansions with values even as high as 1200 and that didn't help.
At the first glance, your analyzer stems the search term ("grab") and renders it unusable ("grabić").
Without going into details on how to resolve this, please consider getting rid of polish analyzer here. We are talking about people's names, not "ordinary" polish words.
I saw different techniques used in this case: multi-field searches, fuzzy searches, phonetic searches, dedicated plugins.
Some links:
https://www.elastic.co/blog/multi-field-search-just-got-better
http://www.basistech.com/fuzzy-search-names-in-elasticsearch/
https://www.found.no/play/gist/6c6434c9c638a8596efa
But I guess in case of polish names some kind of prefix query on non-analyzed field would suffice...

Resources