Synonym token filter - elasticsearch

I created a test index with synonym token filter
PUT /synonyms-index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"shares","equity","stock"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
Then I ran analyze API ,
post synonyms-index/_analyze
{
"analyzer":"my_synonyms",
"text":"equity awesome"
}
I got the following response to see what token got into inverted index and I was expecting "shares" and "stock" needed to be added as per the synonym rule, but it doesn't seem so. Am I missing anything here ?
{
"tokens": [
{
"token": "equity",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "awesome",
"start_offset": 7,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
}
]
}

posting the answer for the community-
it is common pitfall with JSON ,
We need to make it as ( put everything in a double quotes which consistutues a rule and it follows simple expansion.)
"synonyms": [ "shares,equity,stock" ]
rather than
"synonyms": [
"shares","equity","stock"
]

Related

How to optimize elasticsearch's full text search to match strings like 'C++'

We have a search engine for text content which contains strings like c++ or c#. The switch to Elasticsearch has shown that the search does not match on terms like 'c++'. ++ is removed.
How can we teach elasticsearch to match correctly in a full text search and not to remove special characters? Characters like comma , should of course still be removed.
You need to create your own custom-analyzer which generates token as per your requirement, for your example I created a below custom analyzer with a text field name language and indexed some sample docs:
Index creation with a custom analyzer
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"char_filter": [
"replace_comma"
]
}
},
"char_filter": {
"replace_comma": {
"type": "mapping",
"mappings": [
", => \\u0020"
]
}
}
}
},
"mappings": {
"properties": {
"language": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Tokens generated for text like c++, c# and c,java.
POST http://{{hostname}}:{{port}}/{{index}}/_analyze
{
"text" : "c#",
"analyzer": "my_analyzer"
}
{
"tokens": [
{
"token": "c#",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
}
]
}
for c,java it generated 2 separate tokens c and java as it replaces , with whitespace shown below:
{
"text" : "c, java",
"analyzer":"my_analyzer"
}
{
"tokens": [
{
"token": "c",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "java",
"start_offset": 3,
"end_offset": 7,
"type": "word",
"position": 1
}
]
}
Note: You need to understand the analysis process and accordingly modify your custom-analyzer to make it work for all of your use-case, My example might not work for all your edge cases, But hope you get an idea on how to handle such requirements.

Elasticsearch custom analyser

Is it possible to create custom elasticsearch analyser which can split index by space and then create two tokens? One, with everything before space and second, with everything.
For example: I have stored record with field which has following text: '35 G'.
Now I want to receive that record by typing only '35' or '35 G' query to that field.
So elastic should create two tokens: ['35', '35 G'] and no more.
If it's possible, how to achieve it ?
Doable using path_hierarchy tokenizer.
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "path_hierarchy",
"delimiter": " "
}
}
}
}
...
}
And now
POST test/_analyze
{
"analyzer": "my_analyzer",
"text": "35 G"
}
outputs
{
"tokens": [
{
"token": "35",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "35 G",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
}
]
}

Elasticsearch custom analyzer with two output tokens

Requirement is to create a custom analyzer which can generate two tokens as shown in below scenarios.
E.g.
Input -> B.tech in
Output Tokens ->
- btechin
- b.tech in
I am able to remove non-alphanumeric character, but how to retain original one too in the output token list. Below is the custom analyzer that I have created.
"alphanumericStringAnalyzer": {
"filter": [
"lowercase",
"minLength_filter"],
"char_filter": [
"specialCharactersFilter"
],
"type": "custom",
"tokenizer": "keyword"
}
"char_filter": {
"specialCharactersFilter": {
"pattern": "[^A-Za-z0-9]",
"type": "pattern_replace",
"replacement": ""
}
},
This analyzer is generating single token "btechin" for input "B.tech in" but I also want original one too in token list "B.tech in"
Thanks!
You can use the word token delimiter as described in this documentation
Here an example of word delimiter configuration :
POST _analyze
{
"text": "B.tech in",
"tokenizer": "keyword",
"filter": [
"lowercase",
{
"type": "word_delimiter",
"catenate_all": true,
"preserve_original": true,
"generate_word_parts": false
}
]
}
results :
{
"tokens": [
{
"token": "b.tech in",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "btechin",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
}
]
}
I hope it will fulfill your requirements!

Elasticsearch match certain fields exactly but not others

I am needing ElasticSearch to match certain fields exactly, currently using multi_match.
For example, a user types in long beach chiropractor.
I want long beach to match the city field exactly, and not return results for seal beach or glass beach.
At the same time chiropractor should also match chiropractic.
Here is the current query I am using:
"query": {
"bool": {
"should": [
{
"multi_match": {
"fields": [
"title",
"location_address_address_1.value",
"location_address_city.value^2",
"location_address_state.value",
"specialty" // e.g. chiropractor
],
"query": "chiropractor long beach",
"boost": 6,
"type": "cross_fields"
}
}
]
}
},
The right approach would be to separate term that is searched and location, and store location as keyword type. If that's not possible then you can use synonym tokenizer to store locations as single tokens, but this will require to have the list of all possible locations. e.g.
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"long beach=>long-beach"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
Now if you call
POST /my_index/_analyze?analyzer=my_synonyms
{
"text": ["chiropractor long beach"]
}
the response is
{
"tokens": [
{
"token": "chiropractor",
"start_offset": 0,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "long-beach",
"start_offset": 13,
"end_offset": 23,
"type": "SYNONYM",
"position": 1
}
]
}

Phonetic search results for integers with Elasticserach

Forgive me as I am new to Elasticsearch, but I am following the Phonetic start guide found here: Phonetic Matching
I have the following
POST /app
{
"settings": {
"index": {
"analysis": {
"filter": {
"dbl_metaphone": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "standard",
"filter": "dbl_metaphone"
}
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
},
"year": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
}
}
}
} }
I add some results by doing:
POST /app/movie
{ "title": "300", "year": 2006"} & { "title":"500 days of summer", "year": "2009" }
I want to query for the movie '300' by entering this query though:
POST /app/movie/_search
{
"query": {
"match": {
"title.phonetic": {
"query": "three hundred"
}
}
}
}
but I get no results. If change my query to "300" though it works just fine.
If I do:
GET /app/_analyze?analyzer=dbl_metaphone&text=300
{
"tokens": [
{
"token": "300",
"start_offset": 0,
"end_offset": 3,
"type": "<NUM>",
"position": 0
}
]
}
I see that there is only a number token returned not alphanumeric version like:
GET /app/_analyze?analyzer=dbl_metaphone&text=three hundred
{
"tokens": [
{
"token": "0R",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "TR",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "HNTR",
"start_offset": 6,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Is there something that I am missing with my phonetic query that I am supposed to define to get both the numerical and alphanumeric tokens?
That is not possible. Double Metaphone is a form of phonetic encoding algorithm.
Simply put it tries to encode similarly pronounced words to the same key.
This facilitates to search for terms like names that could be spelt differently but sound the same.
As you can see from the algorithm double metaphone ignores numbers/numeric characters.
You can read more about double metaphone here.
A better case for phonetic matching is finding "Judy Steinheiser" when the search query is [Jodi Stynehaser].
If you need to be able to search numbers using English, then you'll need to create some synonyms or alternate text at index-time, so that both "300" and "three hundred" are stored in Elasticsearch.
Shouldn't be too hard to find/write a function that converts integers to English.
Call your function when constructing your document to ingest into ES.
Alternately, write it in Groovy, and call it as a Transform script in your mapping.

Resources