I tried to debug my synonym search .it seems like when i use wornet format and use the wn_s.pl file it doesn't work, but when i use a custom synonym.txt file then it works.Please let me know where i am doing wrong.please find my below index:
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"format": "wordnet",
"synonyms_path": "analysis/wn_s.pl"
}
},
"analyzer": {
"synonym": {
"tokenizer": "standard",
"filter": ["lowercase",
"synonym"
]
}
},
"mappings": {
"properties": {
"firebaseId": {
"type": "text"
},
"name": {
"fielddata": true,
"type": "text",
"analyzer": "standard"
},
"name_auto": {
"type": "text"
},
"category_name": {
"type": "text",
"analyzer": "synonym"
},
"sku": {
"type": "text"
},
"price": {
"type": "text"
},
"magento_id": {
"type": "text"
},
"seller_id": {
"type": "text"
},
"square_item_id": {
"type": "text"
},
"square_variation_id": {
"type": "text"
},
"typeId": {
"type": "text"
}
}
}
}
}
}
}
I am trying to do synonym search on category_name ,i have items like shoes and dress etc .when i search for boots,flipflop or slipper nothing comes.
here is my query search:
{
"query": {
"match": {
"category_name": "flipflop"
}
}
}
Your wordnet synonym format is not correct. Please have a look here
For a fast implementation please look at the synonyms.json
Related
I am working on a project to perform multilingual full-text search using Elasticsearch.
one field can contain a word combination of different languages or transliteration. for example in the English text may contain Armenian words. or Russian words in the Armenian text.
and i am trying now to configure text analysis with language analyzer.
How correct is my analyzer, And will it work at all ?
PUT /example{
"settings": {
"analysis": {
"filter": {
"armenian_stop": {
"type": "stop",
"stopwords": "_armenian_"
},
"armenian_keywords": {
"type": "keyword_marker",
"keywords": ["օրինակ"]
},
"armenian_stemmer": {
"type": "stemmer",
"language": "armenian"
},
"russian_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"russian_keywords": {
"type": "keyword_marker",
"keywords": ["пример"]
},
"russian_stemmer": {
"type": "stemmer",
"language": "russian"
},
"graph_synonyms": {
"type": "synonym",
"synonyms_path": "analysis/synonym.txt"
}
},
"analyzer": {
"rebuilt_armenian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"armenian_stop",
"armenian_keywords",
"armenian_stemmer",
"russian_stop",
"russian_keywords",
"russian_stemmer",
"graph_synonyms"
]
}
}
}},"mappings": {
"properties": {
"age": { "type": "integer" },
"email": { "type": "keyword" },
"name": { "type": "text", "analyzer": "rebuilt_armenian" } ,
"location": {
"type": "geo_point"
}
}}}
I work in a different way with multilinguals. It seems that in your case you don't know what language it is before indexing. In my current scenario, for each language I create a field, using "fields", and for each field I use the language-specific analyzer.
{
"settings": {
"analysis": {
"filter": {
"armenian_stop": {
"type": "stop",
"stopwords": "_armenian_"
},
"armenian_keywords": {
"type": "keyword_marker",
"keywords": [
"օրինակ"
]
},
"armenian_stemmer": {
"type": "stemmer",
"language": "armenian"
},
"russian_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"russian_keywords": {
"type": "keyword_marker",
"keywords": [
"пример"
]
},
"russian_stemmer": {
"type": "stemmer",
"language": "russian"
},
"graph_synonyms": {
"type": "synonym",
"synonyms_path": "analysis/synonym.txt"
}
},
"analyzer": {
"rebuilt_armenian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"armenian_stop",
"armenian_keywords",
"armenian_stemmer"
]
},
"rebuilt_russian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer"
]
}
}
}
},
"mappings": {
"properties": {
"age": {
"type": "integer"
},
"email": {
"type": "keyword"
},
"name": {
"type": "text",
"fields": {
"ar": {
"type": "text",
"analyzer": "rebuilt_armenian"
},
"ru": {
"type": "text",
"analyzer": "rebuilt_russian"
}
}
},
"location": {
"type": "geo_point"
}
}
}
}
And during the indexing and during the search I don't know what language the text is in.
and as far as I understand it is necessary to search for specific fields, if you search for example by "name" then the standard parser will work
{
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [ "name.ar", "name.ru"],
"query": "phone"
}
}
],
"filter": [
{
"geo_distance": {
"distance": "25km",
"location": {
"lat": 40.79420000 ,
"lon": 43.84528000
}
}
}
]
}
}
}
You can try to check your analyzer with the analyzer API: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
Enter some mixed text and see if the result is what you want.
Sometimes it is also ok to just use standard analyzer and forget about eliminating language-specific stopwords or stemming.
I am trying to add autocomplete based on what the user searches.
Currently, I have the following mapping:
{
"courts_2": {
"mappings": {
"properties": {
"author": {
"type": "text",
"analyzer": "my_analyzer"
},
"bench": {
"type": "text",
"analyzer": "my_analyzer"
},
"citation": {
"type": "text"
},
"content": {
"type": "text",
"fields": {
"standard": {
"type": "text"
}
},
"analyzer": "my_analyzer"
},
"court": {
"type": "text"
},
"date": {
"type": "text"
},
"id_": {
"type": "text"
},
"title": {
"type": "text",
"fields": {
"standard": {
"type": "text"
}
},
"analyzer": "my_analyzer"
},
"verdict": {
"type": "text"
}
}
}
}
}
Below is the code I used for the settings:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_metaphone"
]
}
},
"filter": {
"my_metaphone": {
"type": "phonetic",
"encoder": "metaphone",
"replace": true
}
}
}
}
},
"mappings": {
"properties": {
"author": {
"type": "text",
"analyzer": "my_analyzer"
},
"bench": {
"type": "text",
"analyzer": "my_analyzer"
},
"citation": {
"type": "text"
},
"court": {
"type": "text"
},
"date": {
"type": "text"
},
"id_": {
"type": "text"
},
"verdict": {
"type": "text"
},
"title": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"standard": {
"type": "text"
}
}
},
"content": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"standard": {
"type": "text"
}
}
}
}
}
}
Here's what I would like to implement:
I would like to collect and store all of the queries made to the endpoint and use autocomplete on that. For example, to date, all of the users have made the following queries -
Real Madrid v/s Barcelona
Real Madrid Team
Real Madrid Coach
Barcelona v/s Man City
Sevilla Home Ground
Man Utd. recent results
Now, if anyone searches Rea then the following autocomplete queries should be suggested:
Real Madrid v/s Barcelona
Real Madrid Team
Real Madrid Coach
This based on the searches made by all of the users till date and not a single user. Further, I would like to analyze what are the top queries that were made in let's say the past month.
I am using ElasticSearch version 7.1 on AWS Elasticsearch service.
Edit: I have considerably digressed from the initial question as my need has evolved a bit. I apologise if this has caused any troubles.
I am currenty using Elasticsearch's phonetic analyzer. I want the query to give higher score to exact matches then phonetic ones. Here is the query I am using:
{
"query": {
"multi_match" : {
"query" : "Abhijeet",
"fields" : ["content", "title"]
}
},
"size": 10,
"_source": [ "title", "bench", "court", "id_" ],
"highlight": {
"fields" : {
"title" : {},
"content":{}
}
}
}
When I search for Abhijeet, the top queries are Abhijit and only later does Abhijeet come. I want the exact matches to appear first, all the time and then the phonetic ones. Can this be done?
Edit:
Mappings
{
"courts_2": {
"mappings": {
"properties": {
"author": {
"type": "text",
"analyzer": "my_analyzer"
},
"bench": {
"type": "text",
"analyzer": "my_analyzer"
},
"citation": {
"type": "text"
},
"content": {
"type": "text",
"analyzer": "my_analyzer"
},
"court": {
"type": "text"
},
"date": {
"type": "text"
},
"id_": {
"type": "text"
},
"title": {
"type": "text",
"analyzer": "my_analyzer"
},
"verdict": {
"type": "text"
}
}
}
}
}
Here is the code I used to set up the phonetic analyzer:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_metaphone"
]
}
},
"filter": {
"my_metaphone": {
"type": "phonetic",
"encoder": "metaphone",
"replace": true
}
}
}
}
},
"mappings": {
"properties": {
"author": {
"type": "text",
"analyzer": "my_analyzer"
},
"bench": {
"type": "text",
"analyzer": "my_analyzer"
},
"citation": {
"type": "text"
},
"content": {
"type": "text",
"analyzer": "my_analyzer"
},
"court": {
"type": "text"
},
"date": {
"type": "text"
},
"id_": {
"type": "text"
},
"title": {
"type": "text",
"analyzer": "my_analyzer"
},
"verdict": {
"type": "text"
}
}
}
}
Now, I want to query only the title and the content field. Here, I want the exact matches to appear first and then the phonetic ones.
The general solution approach is:
to use a bool-query,
with your ponectic query/queries in the must clause,
and the non-phonetic query/queries in the should clause
I can update the answer if you include the mappings and settings of your index to your question.
Update: Solution Approach
A. Expand your mapping to use multi-fields for title and content:
"title": {
"type": "text",
"analyzer": "my_analyzer",
"fields" : {
"standard" : {
"type" : "text"
}
}
},
...
"content": {
"type": "text",
"analyzer": "my_analyzer"
"fields" : {
"standard" : {
"type" : "text"
}
}
},
B. Get the fields populated (e.g. by re-indexing everything):
POST courts_2/_update_by_query
C. Adjust your query to leverage the newly introduced fields:
GET courts_2/_search
{
"_source": ["title","bench","court","id_"],
"size": 10,
"query": {
"bool": {
"must": {
"multi_match": {
"query": "Abhijeet",
"fields": ["title", "content"]
}
},
"should": {
"multi_match": {
"query": "Abhijeet",
"fields": ["title.standard", "content.standard"]
}
}
}
},
"highlight": {
"fields": {
"title": {},
"content": {}
}
}
}
I have read about previous version of ES (< 2) where the "token_analyzer" key needs to be changed to "analyzer". But no matter what I do I am still getting this error:
"type": "mapper_parsing_exception",
"reason": "analyzer on field [email] must be set when search_analyzer is set"
Here is what I am passing into ES via a PUT function when I get the error:
{
"settings": {
"analysis": {
"analyzer": {
"my_email_analyzer": {
"type": "custom",
"tokenizer": "uax_url_email",
"filter": ["lowercase", "stop"]
}
}
}
},
"mappings" : {
"uuser": {
"properties": {
"email": {
"type": "text",
"search_analyzer": "my_email_analyzer",
"fields": {
"email": {
"type": "text",
"analyzer": "my_email_analyzer"
}
}
},
"facebookId": {
"type": "text"
},
"name": {
"type": "text"
},
"profileImageUrl": {
"type": "text"
},
"signupDate": {
"type": "date"
},
"username": {
"type": "text"
}
,
"phoneNumber": {
"type": "text"
}
}
}
}
}
Any ideas what is wrong?
Because you have specified a search_analyzer for the field, you also have to specify the analyzer to be used at indexing time. For example, add this line under where you specify the search_analyzer:
"analyzer": "standard",
To give you this:
{
"settings": {
"analysis": {
"analyzer": {
"my_email_analyzer": {
"type": "custom",
"tokenizer": "uax_url_email",
"filter": ["lowercase", "stop"]
}
}
}
},
"mappings" : {
"uuser": {
"properties": {
"email": {
"type": "text",
"search_analyzer": "my_email_analyzer",
"analyzer": "standard",
"fields": {
"email": {
"type": "text",
"analyzer": "my_email_analyzer"
}
}
},
"facebookId": {
"type": "text"
},
"name": {
"type": "text"
},
"profileImageUrl": {
"type": "text"
},
"signupDate": {
"type": "date"
},
"username": {
"type": "text"
}
,
"phoneNumber": {
"type": "text"
}
}
}
}
}
See also: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html
I am looking for a way to make ES search the data with multiple analyzers.
NGram analyzer and one or few language analyzers.
Possible solution will be to use multi-fields and explicitly declare which analyzer to use for each field.
For example, to set the following mappings:
"mappings": {
"my_entity": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"ngram": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"spanish": {
"type": "string",
"analyzer": "spanish"
},
"english": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
The problem with that is that I have explicitly write every field and its analyzers to a search query.
And it will not allow to search with "_all" and use multiple analyzers.
Is there a way to make "_all" query use multiple analyzers?
Something like "_all.ngram", "_all.spanish" and without using copy_to do duplicate the data?
Is it possible to combine ngram analyzer with a spanish (or any other foreign language) and make a single custom analyzer?
I have tested the following settings but these did not work:
PUT /ngrams_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
}
},
"filter": {
"ngram_filter": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
},
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_keywords": {
"type": "keyword_marker",
"keywords": ["ejemplo"]
},
"spanish_stemmer": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase",
"spanish_stop",
"spanish_keywords",
"spanish_stemmer"
]
}
}
}
},
"mappings": {
"my_entity": {
"_all": {
"enabled": true,
"analyzer": "ngram_analyzer"
},
"properties": {
"my_field": {
"type": "text",
"fields": {
"analyzer1": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"analyzer2": {
"type": "string",
"analyzer": "spanish"
},
"analyzer3": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
}
GET /ngrams_index/_analyze
{
"field": "_all",
"text": "Hola, me llamo Juan."
}
returns: just ngram results, without Spanish analysis
where
GET /ngrams_index/_analyze
{
"field": "my_field.analyzer2",
"text": "Hola, me llamo Juan."
}
properly analyzes the search string.
Is it possible to build a custom analyzer which combine Spanish and ngram?
There is a way to create a custom ngram+language analyzer:
PUT /ngrams_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"ngram_filter": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
},
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_keywords": {
"type": "keyword_marker",
"keywords": [
"ejemplo"
]
},
"spanish_stemmer": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stop",
"spanish_keywords",
"spanish_stemmer",
"ngram_filter"
]
}
}
}
},
"mappings": {
"my_entity": {
"_all": {
"enabled": true,
"analyzer": "ngram_analyzer"
},
"properties": {
"my_field": {
"type": "text",
"analyzer": "ngram_analyzer"
}
}
}
}
}
GET /ngrams_index/_analyze
{
"field": "my_field",
"text": "Hola, me llamo Juan."
}