How to preserve original term during transliteration in Elasticsearch with ICU plugin? - elasticsearch

I'm using the folowing ICU transform filter to peform transliteration
"transliterate": {
"type": "icu_transform",
"id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC"
}
Current problem is that this filter replace the original term in index so search in native language is not possible with term query like this
{
"terms" : {
"field" : [
"term"
],
"boost" : 1.0
}
}
Is there any way to make icu_transform filter produce 2 terms original one and transliterated one?
If no i think the optimal solution will be maping with copy to another field and analyzer for this field without transliterate filter. Can you suggest smth more efficient?
I'm using Elasticsearch 5.6.4

Multi-fields allow you to index the same source value to different fields in different ways. You can index to a field with the standard analyzer and to another field with an analyzer that applies the ICU transform filter. For example,
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"latin": {
"type": "text",
"analyzer": "latin"
}
}
}
}
}
}
}
Then you can query the my_field or my_field.latin field.

Related

Elasticsearch: Is there a way to exclude synomyms from highlighting?

I'm trying to exclude synonyms from highlighting. I created a copy of my current analyzer with a synonym filter. So for each field I now have an analyzer and a search_analyzer. The search analyzer is the new analyzer with all the same filters plus the synonym filter.
Any ideas? I am using elasticsearch 5.2
Mapping:
"mappings": {
"doc": {
"properties": {
"body": {
"type": "text",
"analyzer": "custom_analyzer",
"search_analyzer": "custom_analyzer_with_synonyms",
"fields": {
"plain": {
"type": "text",
"analyzer": "standard"
}
}
}
}
}
Search Query:
{
"query": {
"match": {
"body": "something"
}
},
"highlight": {
"pre_tags": "<strong>",
"post_tags": "<strong>",
"fields" : {
"body.plain" : {
"number_of_fragments": 1,
"require_field_match": false
}
}
}
}
I am not sure about the reason behind the problem. I'd have thought that simply highlighting on a non-synonym-analyzed field would have done it. But according to the comments, it is still highlighting the synonyms. There are 2 possible reasons i can think of: (I haven't looked into the highlighter source code)
It could be because of the multi-word synonym problem mentioned in this link: https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-word-synonyms.html It could be fixed now since the link is old. If not, it could be causing the highlighter to look at wrong position offsets.
And/Or, it could also be because of not using the highlight field in the query. The highlighter might be simply using the tokens emitted from the searched field's analyzer (which would contain synonyms) and looking for those tokens in the highlighted field.
If it's the 1st problem, you could try to change your synonyms to use simple contraction. See: https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html#synonyms-contraction But, it has its own problems with the frequencies of uncommon words and could be a lot of work.
Fixing for the second case would be to use the "body.plain" field in the query, but you cannot do that since it affects your scores. In that case, specifying a different query for the highlighter (so that scores are not affected) on the non-synonym field does the trick. It works even if the 1st case is the problem too since we are not using synonyms in the highlight field.
So your query should look something like this:
{
"query": {
"match": {
"body": "something"
}
},
"highlight": {
"pre_tags": "<strong>",
"post_tags": "<strong>",
"fields" : {
"body.plain" : {
"number_of_fragments": 1,
"highlight_query": {
"match": {"body.plain": "something"}
}
}
}
}
}
See: https://www.elastic.co/guide/en/elasticsearch/reference/5.4/search-request-highlighting.html#_highlight_query

Request Body search in Elasticsearch

I am using Elasticsearch 5.4.1. Here is mapping:
{
"testi": {
"mappings": {
"testt": {
"properties": {
"last": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
When I use URI search I receive results. On the other hand during using Request Body search there is empty result in any case.
GET testi/testt/_search
{
"query" : {
"term" : { "name" : "John" }
}
}
Couple things going on here:
For both last and name, you are indexing the field itself as text and then a subfield as a keyword. Is that your intention? You want to be able to do analyzed/tokenized search on the raw field and then keyword search on the subfield?
If that is your intention, you now have two ways to query each of these fields. For example, name gives you the analyzed version of the field (you designed type text meaning Elasticsearch applied a standard analyzer on it and applied lowercase filter, some basic tokenizing and stemming, etc.) and name.keyword gives you the unaltered keyword version of this field
Therefore, your terms query expects your input string John to match on the field you're querying against. Since you used capitalization in your query input, you likely want to use the keyword subfield of name so try "term" : { "name.keyword" : "John" } instead.
As a light demonstration of what is happening to the original field, "term" : { "name.keyword" : "john" } should work as well
You are seeing results in _search because it is just executing a match_all. If you did pass a basic text parameter, it is executing against _all which is a concatenation of all the fields in each document, so both the keyword and text versions are available

What is the best way to handle common term which contains special chars, like C#, C++

I have some documents contains c# or c++ in title which use standard analyzer.
When I query c# on title field, I got all c# and C++ documents, and c++ documents even have higher score than c# document. That makes sense, since both '#' and '++' are removed from token by standard analyzer.
What is the best way to handle this kind special terms? In my case specifically, I want c# documents got higher score than c++ documents when searching for "C#".
Here is approach you can use:
Introduce copy-field where you will have values with special characters. For that you'll need:
Introduce custom analyzer (whitespace tokenizer is important here - it will preserve your special characters):
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":[
"lowercase"
]
}
}
}
}
}
Create copy-field (_wcc suffix will stand for 'with special characters'):
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"prog_lang": {
"type": "text",
"copy_to": "prog_lang_wcc",
"analyzer": "standard"
},
"prog_lang_wcc": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
When issuing query itself you will combine query with boost against prog_lang_wcc field like this (it could be either multi-match or pure boolean + boost):
GET /_search
{
"query": {
"multi_match" : {
"query" : "c#",
"type": "match_phrase",
"fields" : [ "prog_lang_wcc^3", "prog_lang" ]
}
}
}

Wildcard query over _all field on Elasticsearch

I'm trying to perform wildcard queries over the _all field. An example query could be:
GET index/type/_search
{
"from" : 0,
"size" : 1000,
"query" : {
"bool" : {
"must" : {
"wildcard" : {
"_all" : "*tito*"
}
}
}
}
}
The thing is that to use a wildcard query the _all field needs to be not_analyzed, otherwise the query won't work. See ES documentation for more info.
I tried to set the mappings over the _all field using this request:
PUT index
{
"mappings": {
"type": {
"_all" : {
"enabled" : true,
"index_analyzer": "not_analyzed",
"search_analyzer": "not_analyzed"
},
"_timestamp": {
"enabled": "true"
},
"properties": {
"someProp": {
"type": "date"
}
}
}
}
}
But I'm getting the error analyzer [not_analyzed] not found for field [_all].
I want to know what I'm doing wrong and if there is another (better) way to perform this kind of queries.
Thanks.-
Have you tried removing:
"search_analyzer": "not_analyzed"
Also, I wonder how well a wildcard across all properties will scale. Have you looked into NGrams? See the docs here.
Most probably you wanted to give option
"index": "not_analyzed"
Index attribute for a string field, _all is a string field, determines if that field should be analyzed or not.
"search_analyzer" is to set to determine which analyzer should be used for user entered query, which is valid if index attribute is set to analyzed.
"index_analyzer" is to set to determine which analyzer should be used for documents, again which is valid if index attribute is set to analyzed.

How to search with keyword analyzer?

I have keyword analyzer as default analyzer, like so:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "keyword"
}}}}}}
```
But now I can't search anything. e.g:
{
"query": {
"query_string": {
"query": "cast"
}}}
Gives me 0 results all though "cast" is a common value i the indexed documents. (http://gist.github.com/baelter/b0720a52ee5a27e27d3a)
Search for "*" works fine btw.
I only have explicit defaults in my mapping:
{
"oceanography_point": {
"_all" : {
"enabled" : true
},
"properties" : {}
}
}
The index behaves as if no fields are included in _all, because field:value queries works fine.
Am I misusing the keyword analyzer?
Using keyword analyzer , you can only do an exact string match.
Lets assume that you have used keyword analyzer and no filters.
In that case for as string indexed as "Cast away in forest" , neither search for "cast" or "away" will work. You need to do an exact "Cast away in forest" string to match it. ( Assuming no lowercase filter used , you need to give the right case too)
A better approach would be to use multi fields to declare one copy as keyword analyzed and other one normal.
You can search on one of this field and aggregate on the other.
Okey, some 15h of trial and error I can conclude that this works for search:
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"default": {
"type": "keyword"
}}}}}}
How ever this breaks faceting so I ended up using a dynamic template instead:
"dynamic_templates" : [
{
"strings_not_analyzed" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
],

Resources