Elastic Search Standard Analyzer + Synonyms - elasticsearch

I am attempting the get Synonyms added to our current way of searching for products.
The mappings ( part of ) looks like this currently :
{
"keywords": {
"properties": {
"modifiers": {
"type": "string",
"analyzer": "standard"
},
"nouns": {
"type": "string",
"analyzer": "standard"
}
}
}
}
I am interested in use the synonyms filter along with the standard analyzer. And as per the document here from elastic search the right way to do that is to add
{
'analysis': {
'analyzer': {
"synonym" : {
"tokenizer" : 'standard',
"filter" : ['standard', 'lowercase', 'stop', 'external_synonym']
}
},
'filter': {
'external_synonym': {
'type': 'synonym',
'synonyms': synonyms
}
}
}
into the mappings and use that in the analyzer field in the snippet above. But this does not work - in the sense that - the behaviour is quite different ( even without adding any synonyms ).
I am interested in preserving the relevancy behaviour ( as provided by the standard analyzer ) and just added the synonyms list.
Could someone please provide more information on how to exactly replicate the standard analyzer behaviour ?

Related

Liferay portal 7.3.7 case insensitive, diacritics free with ElasticSearch

I am having a dilema on liferay portal 7.3.7 with case insensitive and diacritis free search through elasticsearch in JournalArticles with custom ddm fields. Liferay generated fieldmappings in Configuration->Search like this:
...
},
"localized_name_sk_SK_sortable" : {
"store" : true,
"type" : "keyword"
},
...
I would like to have these *_sortable fields usable for case insensitive and dia free searching, so I tried to add analyzer and normalizer to liferay search advanced configuration in System Settings->Search->Elasticsearch 7 like this:
{
"analysis":{
"analyzer":{
"ascii_analyzer":{
"tokenizer": "standard",
"filter":["asciifolding","lowercase"]
}
},
"normalizer": {
"ascii_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
}
After that, I overrided mapping for template_string_sortable:
{
"template_string_sortable" : {
"mapping" : {
"analyzer": "ascii_analyzer",
"normalizer": "ascii_normalizer",
"store" : true,
"type" : "keyword"
},
"match_mapping_type" : "string",
"match" : "*_sortable"
}
}
After reindexing, my sortable fields looks like this:
...
},
"localized_name_sk_SK_sortable" : {
"normalizer" : "ascii_normalizer",
"store" : true,
"type" : "keyword"
},
...
Next, I try to create new content for my ddm structure, but all my sortable fields looks same, like this:
"localized_title_sk_SK": "test diakrity časť 1 ľščťžýáíéôň title",
"localized_title_sk_SK_sortable": "test diakrity časť 1 ľščťžýáíéôň title",
but I need that sortable field without national characters, so i.e. I can find by "cast 1" through wildcardQuery in localized_title_sk_SK_sortable and so on... THX for any advice (maybe I just have wrong appearance to whole problem? And I am really new to ES)
First of all it would be better to apply original_ascii_folding and then lowercase filter, but keep in mind this filter are for search and your _source data wouldn't be changed because you applied analyzer on the field.
If you need to manipulate the data before ingesting it you can use Ingest pipeline feature in Elasticsearch for more information check here.

How to preserve original term during transliteration in Elasticsearch with ICU plugin?

I'm using the folowing ICU transform filter to peform transliteration
"transliterate": {
"type": "icu_transform",
"id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC"
}
Current problem is that this filter replace the original term in index so search in native language is not possible with term query like this
{
"terms" : {
"field" : [
"term"
],
"boost" : 1.0
}
}
Is there any way to make icu_transform filter produce 2 terms original one and transliterated one?
If no i think the optimal solution will be maping with copy to another field and analyzer for this field without transliterate filter. Can you suggest smth more efficient?
I'm using Elasticsearch 5.6.4
Multi-fields allow you to index the same source value to different fields in different ways. You can index to a field with the standard analyzer and to another field with an analyzer that applies the ICU transform filter. For example,
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"latin": {
"type": "text",
"analyzer": "latin"
}
}
}
}
}
}
}
Then you can query the my_field or my_field.latin field.

What is the best way to handle common term which contains special chars, like C#, C++

I have some documents contains c# or c++ in title which use standard analyzer.
When I query c# on title field, I got all c# and C++ documents, and c++ documents even have higher score than c# document. That makes sense, since both '#' and '++' are removed from token by standard analyzer.
What is the best way to handle this kind special terms? In my case specifically, I want c# documents got higher score than c++ documents when searching for "C#".
Here is approach you can use:
Introduce copy-field where you will have values with special characters. For that you'll need:
Introduce custom analyzer (whitespace tokenizer is important here - it will preserve your special characters):
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":[
"lowercase"
]
}
}
}
}
}
Create copy-field (_wcc suffix will stand for 'with special characters'):
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"prog_lang": {
"type": "text",
"copy_to": "prog_lang_wcc",
"analyzer": "standard"
},
"prog_lang_wcc": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
When issuing query itself you will combine query with boost against prog_lang_wcc field like this (it could be either multi-match or pure boolean + boost):
GET /_search
{
"query": {
"multi_match" : {
"query" : "c#",
"type": "match_phrase",
"fields" : [ "prog_lang_wcc^3", "prog_lang" ]
}
}
}

How to search with keyword analyzer?

I have keyword analyzer as default analyzer, like so:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "keyword"
}}}}}}
```
But now I can't search anything. e.g:
{
"query": {
"query_string": {
"query": "cast"
}}}
Gives me 0 results all though "cast" is a common value i the indexed documents. (http://gist.github.com/baelter/b0720a52ee5a27e27d3a)
Search for "*" works fine btw.
I only have explicit defaults in my mapping:
{
"oceanography_point": {
"_all" : {
"enabled" : true
},
"properties" : {}
}
}
The index behaves as if no fields are included in _all, because field:value queries works fine.
Am I misusing the keyword analyzer?
Using keyword analyzer , you can only do an exact string match.
Lets assume that you have used keyword analyzer and no filters.
In that case for as string indexed as "Cast away in forest" , neither search for "cast" or "away" will work. You need to do an exact "Cast away in forest" string to match it. ( Assuming no lowercase filter used , you need to give the right case too)
A better approach would be to use multi fields to declare one copy as keyword analyzed and other one normal.
You can search on one of this field and aggregate on the other.
Okey, some 15h of trial and error I can conclude that this works for search:
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"default": {
"type": "keyword"
}}}}}}
How ever this breaks faceting so I ended up using a dynamic template instead:
"dynamic_templates" : [
{
"strings_not_analyzed" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
],

How to index both a string and its reverse?

I'm looking for a way to analyze the string "abc123" as ["abc123", "321cba"]. I've looked at the reverse token filter, but that only gets me ["321cba"]. Documentation on this filter is pretty sparse, only stating that
"A token filter of type reverse ... simply reverses each token."
(see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-reverse-tokenfilter.html).
I've also tinkered with using the keyword_repeat filter, which gets me two instances. I don't know if that's useful, but for now all it does it reverse both instances.
How can I use the reverse token filter but keep the original token as well?
My analyzer:
{ "settings" : { "analysis" : {
"analyzer" : {
"phone" : {
"type" : "custom"
,"char_filter" : ["strip_non_numeric"]
,"tokenizer" : "keyword"
,"filter" : ["standard", "keyword_repeat", "reverse"]
}
}
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
}
}}}
Make and put a analyzer to reverse a string (say reverse_analyzer).
PUT index_name
{
"settings": {
"analysis": {
"analyzer": {
"reverse_analyzer": {
"type": "custom",
"char_filter": [
"strip_non_numeric"
],
"tokenizer": "keyword",
"filter": [
"standard",
"keyword_repeat",
"reverse"
]
}
},
"char_filter": {
"strip_non_numeric": {
"type": "pattern_replace",
"pattern": "[^0-9]",
"replacement": ""
}
}
}
}
}
then, for a field, (say phoneno), use mapping as, (create a type and append mapping for phone as)
PUT index_name/type_name/_mapping
{
"type_name": {
"properties": {
"phone_no": {
"type": "string",
"fields": {
"reverse": {
"type": "string",
"analyzer": "reverse_analyzer"
}
}
}
}
}
}
So, phone_no is like multifield, which will store a string and its reverse as,
if you index
phone_no: 911220
then in elasticsearch, there will be fields as,
phone_no: 911220 and phone_no.reverse : 022119, so you can search, filter reverse or not-reversed field.
Hope this helps.
I don't believe you can do this directly, as I am unaware of any way to get the reverse token filter to also output the original.
However, you could use the fields parameter to index both the original and the reversed at the same time with no additional coding. You would then search both fields.
So let's say your field was called phone_number:
"phone_number": {
"type": "string",
"fields": {
"reverse": { "type": "string", "index": "phone" }
}
}
In this case we're indexing using the default analyzer (assume standard) plus also indexing into reverse with your customer analyzer phone which reverses. You then issue your queries against both fields.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html
I'm not sure it's possible to do this using built-in set of token filters. I would recommend you to create your own plugin. There is ICU Analysis plugin supported by elastic search team, that you can use as example.
I wound up using the following two char_filter's in my analyzer. It's an ugly abuse of regex, but it seems to work. It is limited to the first 20 numeric characters, but in my use-case that is acceptable.
First it groups all numeric characters, then explicitly rebuilds the string with its own (numeric-only!) reverse. The space in the center of the replacement pattern then causes the tokenizer to split it into two tokens - the original and the reverse.
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
,"dupe_and_reverse" : {
"type" : "pattern_replace"
,"pattern" : "([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)"
,"replacement" : "$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15$16$17$18$19$20 $20$19$18$17$16$15$14$13$12$11$10$9$8$7$6$5$4$3$2$1"
}
}

Resources