Searching synonyms in elasticsearch - elasticsearch

I'm trying to create synonym search over languages indexed in ES.
For example,
Indexed document -> name: German
Synonyms: German, Deutsch, XYZ
What I want to make is, when I type either German or Deutsch or XYZ, that ES returns me German...
Is that possible at all?

Yes very much so. ElasticSearch handles synonyms very well. Here is an example of how I configured synonyms on my cluster -
curl -XPOST localhost:9200/**new-index** -d '{
"settings": {
"number_of_shards": 2,
"number_of_replicas": 0,
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms/synonyms.txt"
}
},
"analyzer": {
"synonym": {
"tokenizer": "lowercase",
"filter": [
"synonym"
]
}
}
}
},
"mappings": {
"**new-type**": {
"_all": {
"enabled": false
},
"properties": {
"Title": {
"type": "multi_field",
"store": "yes",
"fields": {
"Title": {
"type": "string",
"analyzer": "synonym"
}
}
}
}
}
}
}'
The path for the synonym file looks inside the config folder for the synonym folder and locates the text file. An example of the contents of the synonyms.txt for your requirements would be -
German, Deutsch, XYZ
REMEMBER - if you have a lower case filter at index time, the synonyms need to be in lower case. Restart nodes if not working.

Related

elasticsearch synonyms analyzer gives 0 results

I am using elasticsearch 7.0.0.
I am trying to work on synonyms with this configuration while creating index.
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
}
}
}
},
"mappings": {
"properties": {
"address.state": {
"type": "text",
"analyzer": "synonym"
},
"location": {
"type": "geo_point"
}
}
}
}
Here's a document inserted into the index:
{
"name": "Berry's Burritos",
"description": "Best burritos in New York",
"address": {
"street": "230 W 4th St",
"city": "New York",
"state": "NY",
"zip": "10014"
},
"location": [
40.7543385,
-73.976313
],
"tags": [
"mexican",
"tacos",
"burritos"
],
"rating": "4.3"
}
Also content in synonyms.txt:
ny, new york, big apple
When I tried searching for anything in address.state property, I get empty result.
Here's the query:
{
"query": {
"bool": {
"filter": {
"range": {
"rating": {
"gte": 4
}
}
},
"must": {
"match": {
"address.state": "ny"
}
}
}
}
}
Even with ny (as it is:no synonym) in query, the result is empty.
Before, when I created index without mappings, the query used to give the result, only except for synonyms.
But now with mappings, the result is empty even though the term is present.
This query is working though:
{
"query": {
"query_string": {
"query": "tacos",
"fields": [
"tags"
]
}
}
}
I looked and researched into many articles/tutorials and came up this far.
What am I missing here now?
While indexing you are passing the value as "state":"NY". Notice the case of NY. The analyzer synonym define in the settings has only one filter i.e. synonym. NY doesn't match any set of synonyms in defined in synonym.txt due to case. NOTE that NY isn't equal to ny. To overcome this problem (or we can call making it case insensitive) add lowercase filter before synonym filter to synonym analyzer. This will ensure that any input text is lower cased first and then synonym filter is applied. Same will happen when you search on that field using full text search queries.
So you settings will be as below:
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
}
}
}
}
No changes are required in mapping.
Why it initially worked?
Answer to this is because when you haven't defined any mapping, elastic would map address.state as a text field with no explicit analyzer defined for the field. In such case elasticsearch by default uses standard analyzer which uses lowercase token filter as one of the filters. and hence the query matched the document.

Elasticsearch Synonym search analyzer not updating after update synonyms.txt?

So I have an index with the synonym mapping defined in the search analyzer. When I first created the index, the synonyms were picked up on search. After that, I updated the synonyms.txt files on the nodes once to update a synonym mapping and restarted each node after making a change. This caused the synonym change to be reflected on search thoughout the index.
Now, when I change the synonyms file and restart the nodes, the synonym mapping isn't updating as I believe it should. Am I missing something? I thought since the synonym mapping was on a search_analyzer I wouldn't have to reindex each time to reflect the changes.
Here is my index definition:
PUT /synonym_index
{
"aliases": {},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english",
"search_analyzer":"english_and_synonyms"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
},
"english_and_synonyms": {
"tokenizer": "standard",
"filter": [
"search_synonyms",
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
},
"search_synonyms" : {
"type" : "synonym_graph",
"synonyms_path" : "analysis/synonyms.txt"
}
}
},
"index": {
"number_of_shards": "5",
"number_of_replicas": "1"
}
}
}
I've tried restarting the node with
sudo service elasticsearch restart
and also with
sudo service elasticsearch stop
sudo service elasticsearch start
but neither are causing my changes to reflect. Do I need to reindex every time I update the synonyms file even though it's a search analyzer?
To reflect the change in the synonyms file, you need to close and open the index after making the changes to the file. This can be done by doing a post request:
POST /synonym_index/_close
POST /synonym_index/_open
After the _open call, you should see the changes reflected in your searches
Maybe the Reload Search Analyzers API is what you are looking for:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-reload-analyzers.html
You have to declare that your synonyms are updatable:
"search_synonyms" : {
"type" : "synonym_graph",
"synonyms_path" : "analysis/synonyms.txt",
"updatable": true
}
And in your mapping you need to declare your custom search_analyzer:
"mappings": {
"properties": {
"one_attribute": {
"type": "text",
"search_analyzer": "english_and_synonyms"
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html
Do I need to reindex every time I update the synonyms file even though it's a search analyzer?
Only, if your synonyms are being used during index time. If they are only used during search time you don't have to reindex every time.

Elastic Search - how to use language analyzer with UTF-8 filter?

I have a problem with ElasticSearch language analyzer. I am working on Lithuanian language, so I am using Lithuanian language analyzer. Analyzer works fine and I got all word cases I need. For example, I index Lithuania city "Klaipėda":
PUT /cities/city/1
{
"name": "Klaipėda"
}
Problem is that I also need to get a result, when I am searching "Klaipėda" only in Latin alphabet ("Klaipeda") and in all Lithuanian cases:
Nomanitive case: "Klaipeda"
Genitive case: "Klaipedos"
...
Locative case: "Klaipedoje"
"Klaipėda", "Klaipėdos", "Klaipėdoje" - works, but "Klaipeda", "Klaipedos", "Klaipedoje" - not.
My index:
PUT /cities
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"fields": {
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"md_folded_analyzer": {
"type": "lithuanian",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"lithuanian_stop",
"lithuanian_keywords",
"lithuanian_stemmer"
]
}
}
}
}
}
and search query:
GET /cities/_search
{
"query": {
"multi_match" : {
"type": "most_fields",
"query": "klaipeda",
"fields": [ "name", "name.folded" ]
}
}
}
What I am doing wrong? Thanks for help.
The technique you are using here is so-called multi-fields. The limitation of the underlying name.folded field is that you can't perform search against it - you can perform only sorting by name.folded and aggregation.
To make a way round this I've come up with the following set-up:
Separate fields set-up (to eliminate duplicates - just specify copy_to):
curl -XPUT http://localhost:9200/cities -d '
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"copy_to": "folded",
},
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}'
Change the type of your analyzer to custom as it described here, because otherwise the asciifolding is not got into the config. And more important - asciifolding should go after all stemming / stop-words in Lithuanian language, because after folding the word can miss desired sense.
curl -XPUT http://localhost:9200/my_cities -d '
{
"settings": {
"analysis": {
"filter": {
"lithuanian_stop": {
"type": "stop",
"stopwords": "_lithuanian_"
},
"lithuanian_stemmer": {
"type": "stemmer",
"language": "lithuanian"
}
},
"analyzer": {
"md_folded_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"lithuanian_stop",
"lithuanian_stemmer",
"asciifolding"
]
}
}
}
}
}
Sorry I've eliminated lithuanian_keywords - it requires additional set-up, which I missed here. But I hope you've got the idea.

"Letter" tokenizer and "word_delimiter" filter not working with underscores

I built an ElasticSearch index using a custom analyzer which uses letter tokenizer and lower_case and word_delimiter token filters. Then I tried searching for documents containing underscore-separated sub-words, e.g. abc_xyz, using only one of the sub-words, e.g. abc, but it didn't come back with any result. When I tried the full-word, i.e. abc_xyz, it did find the document.
Then I changed the document to have dash-separated sub-words instead, e.g. abc-xyz and tried to search by sub-words again and it worked.
To try to understand what is going on, I thought I would check the terms generated for my documents using _termvector service, and the result was identical for both, the underscore-separated sub-words and the dash-separated sub-words, so really I expect the result of searching to be identical in both cases.
Any idea what I could be doing wrong?
If it helps, this is the settings I used for my index:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"cmt_value_analyzer": {
"tokenizer": "letter",
"filter": [
"lowercase",
"my_filter"
],
"type": "custom"
}
},
"filter": {
"my_filter": {
"type": "word_delimiter"
}
}
}
}
},
"mappings": {
"alertmodel": {
"properties": {
"name": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"productId": {
"type": "double"
},
"productName": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"link": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"updatedOn": {
"type": "date"
}
}
}
}
}

How to implement case sensitive search in elasticsearch?

I have a field in my indexed documents where i need to search with case being sensitive. I am using the match query to fetch the results.
An example of my data document is :
{
"name" : "binoy",
"age" : 26,
"country": "India"
}
Now when I give the following query:
{
“query” : {
“match” : {
“name” : “Binoy"
}
}
}
It gives me a match for "binoy" against "Binoy". I want the search to be case sensitive. It seems by default,elasticsearch seems to go with case being insensitive. How to make the search case sensitive in elasticsearch?
In the mapping you can define the field as not_analyzed.
curl -X PUT "http://localhost:9200/sample" -d '{
"index": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}'
echo
curl -X PUT "http://localhost:9200/sample/data/_mapping" -d '{
"data": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
}
}
}
}'
Now if you can do normal index and do normal search , it wont analyze it and make sure it deliver case insensitive search.
It depends on the mapping you have defined for you field name. If you haven't defined any mapping then elasticsearch will treat it as string and use the standard analyzer (which lower-cases the tokens) to generate tokens. Your query will also use the same analyzer for search hence matching is done by lower-casing the input. That's why "Binoy" matches "binoy"
To solve it you can define a custom analyzer without lowercase filter and use it for your field name. You can define the analyzer as below
"analyzer": {
"casesensitive_text": {
"type": "custom",
"tokenizer": "standard",
"filter": ["stop", "porter_stem" ]
}
}
You can define the mapping for name as below
"name": {
"type": "string",
"analyzer": "casesensitive_text"
}
Now you can do the the search on name.
note: the analyzer above is for example purpose. You may need to change it as per your needs
Have your mapping like:
PUT /whatever
{
"settings": {
"analysis": {
"analyzer": {
"mine": {
"type": "custom",
"tokenizer": "standard"
}
}
}
},
"mappings": {
"type": {
"properties": {
"name": {
"type": "string",
"analyzer": "mine"
}
}
}
}
}
meaning, no lowercase filter for that custom analyzer.
Here is the full index template which worked for my ElasticSearch 5.6:
{
"template": "logstash-*",
"settings": {
"analysis" : {
"analyzer" : {
"case_sensitive" : {
"type" : "custom",
"tokenizer": "standard",
"filter": ["stop", "porter_stem" ]
}
}
},
"number_of_shards": 5,
"number_of_replicas": 1
},
"mappings": {
"fluentd": {
"properties": {
"message": {
"type": "text",
"fields": {
"case_sensitive": {
"type": "text",
"analyzer": "case_sensitive"
}
}
}
}
}
}
}
As you see, the logs are coming from FluentD and are saved into a timebased index logstash-*. To make sure, I can still execute wildcard queries on the message filed, I put a multi-field mapping on that field. Wildcard/analyzed queries can be done on message field and the case sensitive one on the message.case_sensitive field.

Resources