Completion Suggester Foreign Language Accents Greek - elasticsearch

I am trying to use the Completion suggester with Greek language. Unfortunately I have problems with accents like ά. I've tried a few ways.
One was simply to set the greek analyzer in the mapping the other a lowercase analyzer with asciifolding. No success, with greek analyser I dont even get a result with the accent.
Below is what I did, would be great if anyone can help me out here.
Mapping
PUT t1
{
"mappings": {
"profession" : {
"properties" : {
"text" : {
"type" : "keyword"
},
"suggest" : {
"type" : "completion",
"analyzer": "greek"
}
}
}
}
}
Dummy
POST t1/profession/?refresh
{
"suggest" : {
"input": [ "Μάγειρας"]
}
,"text": "Μάγειρας"
}
Query
GET t1/profession/_search
{ "suggest":
{ "profession" :
{ "prefix" : "Μα"
, "completion" :
{ "field" : "suggest"}
}}}

I found a way to do it with a custom analyzer or via a plugin for es which i highly recommend when it comes to non-latin texts.
Option 1
PUT t1
{ "settings":
{ "analysis":
{ "filter":
{ "greek_lowercase":
{ "type": "lowercase"
, "language": "greek"
}
}
, "analyzer":
{ "autocomplete":
{ "tokenizer": "lowercase"
, "filter":
[ "greek_lowercase" ]
}
}
}}
, "mappings": {
"profession" : {
"properties" : {
"text" : {
"type" : "keyword"
},
"suggest" : {
"type" : "completion",
"analyzer": "autocomplete"
}
}}}
}
Option 2 ICU Plugin
Install ES Plugin:
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html
{ "settings": {
"index": {
"analysis": {
"normalizer": {
"latin": {
"filter": [
"custom_latin_transform"
]
}
},
"analyzer": {
"latin": {
"tokenizer": "keyword",
"filter": [
"custom_latin_transform"
]
}
},
"filter": {
"noDelimiter": {"type": "word_delimiter"},
"custom_latin_transform": {
"type": "icu_transform",
"id": "Greek-Latin/UNGEGN; Lower(); NFD; [:Nonspacing Mark:] Remove; NFC"
}
}
}
}
}
, "mappings":
{ "doc" : {
"properties" : {
"verbose" : {
"type" : "keyword"
},
"name" : {
"type" : "keyword"
},
"slugHash":{
"type" : "keyword",
"normalizer": "latin"
},
"level": { "type": "keyword" },
"hirarchy": {
"type" : "keyword"
},
"geopoint": { "type": "geo_point" },
"suggest" :
{ "type" : "completion"
, "analyzer": "latin"
, "contexts":
[ { "name": "level"
, "type": "category"
, "path": "level"
}
]
}}
}
}}

Related

Elasticsearch sort script not working as expected for few documents only

Consider a query such as this one:
{
"size": 200,
"query": {
"bool" : {
....
}
},
"sort": {
"_script" : {
"script" : {
"source" : "params._source.participants[0].participantEmail",
"lang" : "painless"
},
"type" : "string",
"order" : "desc"
}
}
}
This query works for almost every document, for some of them are not at their correct place. How could it be?
The order of the last documents is like that(I'm displaying the first item of the participant array of each doc):
shiend#....
denys#...
Lynn#...
How is it possible? I don't have direction. Is the sort query wrong?
Settings:
"myindex" : {
"settings" : {
"index" : {
"refresh_interval" : "30s",
"number_of_shards" : "5",
"provided_name" : "myindex",
"creation_date" : "1600703588497",
"analysis" : {
"filter" : {
"english_keywords" : {
"keywords" : [
"example"
],
"type" : "keyword_marker"
},
"english_stemmer" : {
"type" : "stemmer",
"language" : "english"
},
"synonym" : {
"type" : "synonym",
"synonyms_path" : "analysis/UK_US_Sync_2.csv",
"updateable" : "true"
},
"english_possessive_stemmer" : {
"type" : "stemmer",
"language" : "possessive_english"
},
"english_stop" : {
"type" : "stop",
"stopwords" : "_english_"
},
"my_katakana_stemmer" : {
"type" : "kuromoji_stemmer",
"minimum_length" : "4"
}
},
"normalizer" : {
"custom_normalizer" : {
"filter" : [
"lowercase",
"asciifolding"
],
"type" : "custom",
"char_filter" : [ ]
}
},
"analyzer" : {
"somevar_english" : {
"filter" : [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer",
"asciifolding",
"synonym"
],
"tokenizer" : "standard"
},
"myvar_chinese" : {
"filter" : [
"porter_stem"
],
"tokenizer" : "smartcn_tokenizer"
},
"myvar" : {
"filter" : [
"my_katakana_stemmer"
],
"tokenizer" : "kuromoji_tokenizer"
}
}
},
"number_of_replicas" : "1",
"uuid" : "d0LlBVqIQGSk4afEWFD",
"version" : {
"created" : "6081099",
"upgraded" : "6081299"
}
}
}
}
Mapping:
{
"myindex": {
"mappings": {
"doc": {
"dynamic_date_formats": [
"yyyy-MM-dd HH:mm:ss.SSS"
],
"properties": {
"all_fields": {
"type": "text"
},
"participants": {
"type": "nested",
"include_in_root": true,
"properties": {
"participantEmail": {
"type": "keyword",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
"normalizer": "custom_normalizer"
}
},
"copy_to": [
"all_fields"
]
},
"participantType": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
"normalizer": "custom_normalizer"
}
},
"copy_to": [
"all_fields"
]
}
}
}
}
}
}
}
}
EDIT: Maybe it's because the email Lynn#.. starts with an uppercase?
Indeed, string are sorted in lexical order, i.e. uppercase letters come prior to lowercase ones (the other way around for descending order)
What you can do is to lowercase all emails in your script:
"sort": {
"_script" : {
"script" : {
"source" : "params._source.participants[0].participantEmail.toLowerCase()",
"lang" : "painless"
},
"type" : "string",
"order" : "desc"
}
}

ELASTICSEARCH- Return filtered fields based on values

I am developing a query in which I would like to identify those fields within "CES", those that show in the field "ces.des" or "ces.output" the value "jack"
{"_source": ["ces.desc","ces.output"] ,
"query": {
"nested": {
"path": "ces",
"query": {
"bool": {
"should": [
{"term": {"ces.desc": "jack"}},
{"term": {"ces.output": "jack"}}
]
}
}
}
},
"aggs": {
"nestedData": {
"nested": {
"path": "ces"
},
"aggs": {
"data_desc": {
"filter": {
"term": {
"ces.desc": "jack"
}
}
}
}
}
}
}
And the output is :
{
"ces" : [
{
"output" : "Laura", <-------------- WRONG
"desc" : "fernando" <-------------- WRONG
},
{
"output" : "",
"desc" : "jack" <-------------- RIGHT
}
"output" : "jack",<-------------- RIGHT
"desc" : "Fer"
},
}
mapping:
{
"names_1" : {
"aliases" : { },
"mappings" : {
"properties" : {
"created_at" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"data" : {
"properties" : {
"addresses" : {
"properties" : {
"asn" : {
"type" : "long"
},
"ces" : {
"type" : "nested",
"properties" : {
"banner" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"desc" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"output" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"source" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"tag" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"error" : {
"type" : "long"
},
"finished_at" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"total" : {
"type" : "long"
}
}
}
}
}
I would like to filter only those that comply that present the values ​​according to the "bool" condition.
if (ces.desc == "jack") or (ces.output == "jack)"
return
ces.desc,ces.output key and value
Even if I add"agg", make a "JACK" count
doc_value = 2
What part of the query am I making the error?
query mapping:
{
"mappings": {
"properties": {
"data.addresses":{
"type":"nested",
"properties": {
"data.addresses.ces": {
"type": "nested"
}
}
}
}
}
}
You need to change your mapping to the following, i.e. BOTH addresses AND ces need to be nested:
{
"aliases": {},
"mappings": {
"properties": {
"created_at": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"data": {
"properties": {
"addresses": {
"type": "nested", <------ MUST BE NESTED
"properties": {
"asn": {
"type": "long"
},
"ces": {
"type": "nested", <------ MUST BE NESTED
"properties": {
"banner": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"desc": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"output": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"source": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"tag": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"error": {
"type": "long"
},
"finished_at": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"total": {
"type": "long"
}
}
}
}
Then you simply need to use nested inner_hits:
{
"_source": false,
"query": {
"nested": {
"path": "data.addresses.ces",
"inner_hits": {}, <---- ADD THIS
"query": {
"bool": {
"should": [
{
"term": {
"data.addresses.ces.desc": "jack"
}
},
{
"term": {
"data.addresses.ces.output": "jack"
}
}
]
}
}
}
},
"aggs": {
"nestedData": {
"nested": {
"path": "data.addresses.ces"
},
"aggs": {
"data_desc": {
"filter": {
"term": {
"data.addresses.ces.desc": "jack"
}
}
}
}
}
}
}
And the response will only contain the nested inner hits containing jack

elasticsearch mapper_parsing_exception Root mapping definition has unsupported parameters

I'm having the following issue with elasticsearch 7 when trying creating a template.
When I'm trying to copy template from elasticsearch 6 to 7 and some of the fields I have removed as per the elasticsearch 7 .e
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "Root mapping definition has unsupported parameters: [events : {properties={msg={fields={raw={type=keyword}}}, requestId={type=keyword}, logger={type=keyword}, host={type=keyword}, jwtOwner={type=keyword}, requestOriginator={type=keyword}, tag={analyzer=firsttoken, fields={disambiguator={analyzer=keyword, type=text}}}, jwtAuthenticatedUser={type=keyword}, thread={type=keyword}, requestChainOriginator={type=keyword}, revision={type=keyword}}}]"
}
],
"type": "mapper_parsing_exception",
"reason": "Failed to parse mapping [_doc]: Root mapping definition has unsupported parameters: [events : {properties={msg={fields={raw={type=keyword}}}, requestId={type=keyword}, logger={type=keyword}, host={type=keyword}, jwtOwner={type=keyword}, requestOriginator={type=keyword}, tag={analyzer=firsttoken, fields={disambiguator={analyzer=keyword, type=text}}}, jwtAuthenticatedUser={type=keyword}, thread={type=keyword}, requestChainOriginator={type=keyword}, revision={type=keyword}}}]",
"caused_by": {
"type": "mapper_parsing_exception",
"reason": "Root mapping definition has unsupported parameters: [events : {properties={msg={fields={raw={type=keyword}}}, requestId={type=keyword}, logger={type=keyword}, host={type=keyword}, jwtOwner={type=keyword}, requestOriginator={type=keyword}, tag={analyzer=firsttoken, fields={disambiguator={analyzer=keyword, type=text}}}, jwtAuthenticatedUser={type=keyword}, thread={type=keyword}, requestChainOriginator={type=keyword}, revision={type=keyword}}}]"
}
},
"status": 400
}
Mapping template: The following is the template I'm trying to post.
POST _template/logstash
{
"order" : 0,
"index_patterns" : [
"logstash*"
],
"settings" : {
"index" : {
"analysis" : {
"filter" : {
"firsttoken" : {
"type" : "pattern_capture",
"preserve_original" : "false",
"patterns" : [
"""^([^\.]*)\.?.*$"""
]
},
"secondtoken" : {
"type" : "pattern_capture",
"preserve_original" : "false",
"patterns" : [
"""^[^\.]*\.([^\.]*)\.?.*$"""
]
},
"thirdtoken" : {
"type" : "pattern_capture",
"preserve_original" : "false",
"patterns" : [
"""^[^\.]*\.[^\.]*\.([^\.]*)\.?.*$"""
]
}
},
"analyzer" : {
"firsttoken" : {
"filter" : [
"firsttoken"
],
"tokenizer" : "keyword"
},
"secondtoken" : {
"filter" : [
"secondtoken"
],
"tokenizer" : "keyword"
},
"thirdtoken" : {
"filter" : [
"thirdtoken"
],
"tokenizer" : "keyword"
}
}
},
"mapper" : {
}
}
},
"mappings" : {
"events" : {
"properties" : {
"msg" : {
"type" : "text",
"fields" : {
"raw" : {
"type" : "keyword"
}
}
},
"requestId" : {
"type" : "keyword"
},
"logger" : {
"type" : "keyword"
},
"host" : {
"type" : "keyword"
},
"jwtOwner" : {
"type" : "keyword"
},
"requestOriginator" : {
"type" : "keyword"
},
"tag" : {
"analyzer" : "firsttoken",
"fields" : {
"disambiguator" : {
"analyzer" : "keyword",
"type" : "text"
}
}
},
"jwtAuthenticatedUser" : {
"type" : "keyword"
},
"thread" : {
"type" : "keyword"
},
"requestChainOriginator" : {
"type" : "keyword"
},
"revision" : {
"type" : "keyword"
}
}
}
},
"aliases" : { }
}
Please help me resolve the issue. Thanks in advance.
There are two issues. One issue is the one mentioned by #OpsterESNinjaKamal
But it still won't work as the tag field has no type.
Here is the template that will work:
PUT _template/logstash
{
"order": 0,
"index_patterns": [
"logstash*"
],
"settings": {
"index": {
"analysis": {
"filter": {
"firsttoken": {
"type": "pattern_capture",
"preserve_original": "false",
"patterns": [
"^([^\\.]*)\\.?.*$"
]
},
"secondtoken": {
"type": "pattern_capture",
"preserve_original": "false",
"patterns": [
"^[^\\.]*\\.([^\\.]*)\\.?.*$"
]
},
"thirdtoken": {
"type": "pattern_capture",
"preserve_original": "false",
"patterns": [
"^[^\\.]*\\.[^\\.]*\\.([^\\.]*)\\.?.*$"
]
}
},
"analyzer": {
"firsttoken": {
"filter": [
"firsttoken"
],
"tokenizer": "keyword"
},
"secondtoken": {
"filter": [
"secondtoken"
],
"tokenizer": "keyword"
},
"thirdtoken": {
"filter": [
"thirdtoken"
],
"tokenizer": "keyword"
}
}
},
"mapper": {}
}
},
"mappings": {
"properties": {
"msg": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"requestId": {
"type": "keyword"
},
"logger": {
"type": "keyword"
},
"host": {
"type": "keyword"
},
"jwtOwner": {
"type": "keyword"
},
"requestOriginator": {
"type": "keyword"
},
"tag": {
"type": "text", <--- add type here
"analyzer": "firsttoken",
"fields": {
"disambiguator": {
"analyzer": "keyword",
"type": "text"
}
}
},
"jwtAuthenticatedUser": {
"type": "keyword"
},
"thread": {
"type": "keyword"
},
"requestChainOriginator": {
"type": "keyword"
},
"revision": {
"type": "keyword"
}
}
},
"aliases": {}
}
Notice your mappings. ES post version 7.0, doesn't support type i.e. events in this case and that is has been deprecated.
Post version 7.0, you would need to create a separate index for every type you've had in the index prior to version 7.0.
This link should help you as how you can migrate from version 6.x to 7.x
Basically your mappings section would be as follows:
{
"mappings":{
"properties":{ <---- Notice there is no `events` before `properties` as mentioned in your question
"msg":{
"type":"text",
"fields":{
"raw":{
"type":"keyword"
}
}
},
"requestId":{
"type":"keyword"
},
"logger":{
"type":"keyword"
},
"host":{
"type":"keyword"
},
"jwtOwner":{
"type":"keyword"
},
"requestOriginator":{
"type":"keyword"
},
"tag":{
"analyzer":"firsttoken",
"fields":{
"disambiguator":{
"analyzer":"keyword",
"type":"text"
}
}
},
"jwtAuthenticatedUser":{
"type":"keyword"
},
"thread":{
"type":"keyword"
},
"requestChainOriginator":{
"type":"keyword"
},
"revision":{
"type":"keyword"
}
}
}
}
Sorry, Vol and Opster I missed adding events template. I deleted the event because it is not accepting. The following is the template for events.
PUT _template/logstash
{
"order" : 0,
"index_patterns" : [
"logstash*"
],
"settings" : {
"index" : {
"analysis" : {
"filter" : {
"firsttoken" : {
"type" : "pattern_capture",
"preserve_original" : "false",
"patterns" : [
"""^([^\.]*)\.?.*$"""
]
},
"secondtoken" : {
"type" : "pattern_capture",
"preserve_original" : "false",
"patterns" : [
"""^[^\.]*\.([^\.]*)\.?.*$"""
]
},
"thirdtoken" : {
"type" : "pattern_capture",
"preserve_original" : "false",
"patterns" : [
"""^[^\.]*\.[^\.]*\.([^\.]*)\.?.*$"""
]
}
},
"analyzer" : {
"firsttoken" : {
"filter" : [
"firsttoken"
],
"tokenizer" : "keyword"
},
"secondtoken" : {
"filter" : [
"secondtoken"
],
"tokenizer" : "keyword"
},
"thirdtoken" : {
"filter" : [
"thirdtoken"
],
"tokenizer" : "keyword"
}
}
},
"mapper" : {
}
}
},
"mappings" : {
"events" : {
"properties" : {
"msg" : {
"type" : "text",
"fields" : {
"raw" : {
"type" : "keyword"
}
}
},
"requestId" : {
"type" : "keyword"
},
"logger" : {
"type" : "keyword"
},
"host" : {
"type" : "keyword"
},
"jwtOwner" : {
"type" : "keyword"
},
"requestOriginator" : {
"type" : "keyword"
},
"tag" : {
"analyzer" : "firsttoken",
"fields" : {
"disambiguator" : {
"analyzer" : "keyword",
"type" : "text"
}
},
"type" : "text"
},
"jwtAuthenticatedUser" : {
"type" : "keyword"
},
"thread" : {
"type" : "keyword"
},
"requestChainOriginator" : {
"type" : "keyword"
},
"revision" : {
"type" : "keyword"
}
}
}
},
"aliases" : { }
}

Elastic Search Highlight Not Working With Custom Analyzer/Tokenizer

I can't figure out why highlight is not working. The query works but highlight just shows the field content without em tags. Here is my settings and mappings:
PUT wmsearch
{
"settings": {
"index.mapping.total_fields.limit": 2000,
"analysis": {
"analyzer": {
"custom": {
"type": "custom",
"tokenizer": "custom_token",
"filter": [
"lowercase"
]
},
"custom2": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"custom_token": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10
}
}
}
},
"mappings": {
"doc": {
"properties": {
"document": {
"properties": {
"reference": {
"type": "text",
"analyzer": "custom"
}
}
},
"scope" : {
"type" : "nested",
"properties" : {
"level" : {
"type" : "integer"
},
"ancestors" : {
"type" : "keyword",
"index" : "true"
},
"value" : {
"type" : "keyword",
"index" : "true"
},
"order" : {
"type" : "integer"
}
}
}
}
}
}
}
Here is my query:
GET wmsearch/_search
{
"query": {
"simple_query_string" : {
"fields": ["document.reference"],
"analyzer": "custom2",
"query" : "bloom"
}
},
"highlight" : {
"fields" : {
"document.reference" : {}
}
}
}
The query does return the correct results and highlight field exists within results. However, there is not em tags around "bloom". Rather, it just shows the entire string with no tags at all.
Does anyone see any issues here or can help?
Thanks
I got it to work by adding "index_options": "offsets" to my mappings for document.reference.

Elasticsearch fuzziness breaks plural

I'm trying to use Elasticsearch with these settings:
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 0,
"analysis" : {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter": [
"standard",
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_stemmer"
]
}
}
}
},
"mappings": {
"_default_": {
"properties" : {
"description" : {
"type" : "text",
"analyzer" : "default",
"search_analyzer": "default"
}
}
}
}
And this search query:
"query": {
"query_string" : {
"query" : "signed~ golf~ hats~",
"fuzziness" : 'AUTO'
}
}
I am trying to search two queries: signed~ golf~ hat~ and signed~ golf~ hats~. Because of the analyzer, I would expect both search results to return the same for plural and singular hats, but they don't. And I think the reason why is because of the fuzziness operator ~. When I remove this, the search results are the same, but then misspellings don't work. Is there a way I could get fuzzy search so that misspellings can be caught but my plural/singular searches return the same results?

Resources