Elasticsearch fuzziness breaks plural

Elasticsearch fuzziness breaks plural - elasticsearch

I'm trying to use Elasticsearch with these settings:
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 0,
"analysis" : {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter": [
"standard",
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_stemmer"
]
}
}
}
},
"mappings": {
"_default_": {
"properties" : {
"description" : {
"type" : "text",
"analyzer" : "default",
"search_analyzer": "default"
}
}
}
}
And this search query:
"query": {
"query_string" : {
"query" : "signed~ golf~ hats~",
"fuzziness" : 'AUTO'
}
}
I am trying to search two queries: signed~ golf~ hat~ and signed~ golf~ hats~. Because of the analyzer, I would expect both search results to return the same for plural and singular hats, but they don't. And I think the reason why is because of the fuzziness operator ~. When I remove this, the search results are the same, but then misspellings don't work. Is there a way I could get fuzzy search so that misspellings can be caught but my plural/singular searches return the same results?

Related

Elasticsearch sort script not working as expected for few documents only

Consider a query such as this one:
{
"size": 200,
"query": {
"bool" : {
....
}
},
"sort": {
"_script" : {
"script" : {
"source" : "params._source.participants[0].participantEmail",
"lang" : "painless"
},
"type" : "string",
"order" : "desc"
}
}
}
This query works for almost every document, for some of them are not at their correct place. How could it be?
The order of the last documents is like that(I'm displaying the first item of the participant array of each doc):
shiend#....
denys#...
Lynn#...
How is it possible? I don't have direction. Is the sort query wrong?
Settings:
"myindex" : {
"settings" : {
"index" : {
"refresh_interval" : "30s",
"number_of_shards" : "5",
"provided_name" : "myindex",
"creation_date" : "1600703588497",
"analysis" : {
"filter" : {
"english_keywords" : {
"keywords" : [
"example"
],
"type" : "keyword_marker"
},
"english_stemmer" : {
"type" : "stemmer",
"language" : "english"
},
"synonym" : {
"type" : "synonym",
"synonyms_path" : "analysis/UK_US_Sync_2.csv",
"updateable" : "true"
},
"english_possessive_stemmer" : {
"type" : "stemmer",
"language" : "possessive_english"
},
"english_stop" : {
"type" : "stop",
"stopwords" : "_english_"
},
"my_katakana_stemmer" : {
"type" : "kuromoji_stemmer",
"minimum_length" : "4"
}
},
"normalizer" : {
"custom_normalizer" : {
"filter" : [
"lowercase",
"asciifolding"
],
"type" : "custom",
"char_filter" : [ ]
}
},
"analyzer" : {
"somevar_english" : {
"filter" : [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer",
"asciifolding",
"synonym"
],
"tokenizer" : "standard"
},
"myvar_chinese" : {
"filter" : [
"porter_stem"
],
"tokenizer" : "smartcn_tokenizer"
},
"myvar" : {
"filter" : [
"my_katakana_stemmer"
],
"tokenizer" : "kuromoji_tokenizer"
}
}
},
"number_of_replicas" : "1",
"uuid" : "d0LlBVqIQGSk4afEWFD",
"version" : {
"created" : "6081099",
"upgraded" : "6081299"
}
}
}
}
Mapping:
{
"myindex": {
"mappings": {
"doc": {
"dynamic_date_formats": [
"yyyy-MM-dd HH:mm:ss.SSS"
],
"properties": {
"all_fields": {
"type": "text"
},
"participants": {
"type": "nested",
"include_in_root": true,
"properties": {
"participantEmail": {
"type": "keyword",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
"normalizer": "custom_normalizer"
}
},
"copy_to": [
"all_fields"
]
},
"participantType": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
"normalizer": "custom_normalizer"
}
},
"copy_to": [
"all_fields"
]
}
}
}
}
}
}
}
}
EDIT: Maybe it's because the email Lynn#.. starts with an uppercase?

Indeed, string are sorted in lexical order, i.e. uppercase letters come prior to lowercase ones (the other way around for descending order)
What you can do is to lowercase all emails in your script:
"sort": {
"_script" : {
"script" : {
"source" : "params._source.participants[0].participantEmail.toLowerCase()",
"lang" : "painless"
},
"type" : "string",
"order" : "desc"
}
}

Using different language analyzers with ngram Analyzer in one mapping in Elasticsearch

i want to use english and german custom analyzers together with other analyzers for example ngram. Is the following mapping correct? i am getting error for german analyzer. [unknown setting [index.filter.german_stop.type]. i searched but i did not find any information about using multiple language analyzers in custom type. Is it possible to use language specific ngram-filter?
PUT test {
"settings": {
"analysis": {
"analyzer": {
"english_analyzer": {
"type": "custom",
"filter": [
"lowercase",
"english_stop",
"ngram_filter_en"
],
"tokenizer": "whitespace"
}
},
"filter": {
"english_stop": {
"type": "stop"
},
"ngram_filter_en": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 25
}
},
"german_analyzer" : {
"type" : "custom",
"filter" : [
"lowercase",
"german_stop",
"ngram_filter_de"
],
"tokenizer" : "whitespace"
}
},
"filter" : {
"german_stop" : {
"type" : "stop"
},
"ngram_filter_de" : {
"type" : "edge_ngram",
"min_ngram" : "1",
"max_gram" : 25
}
}
},
"mappings" : {
"dynamic" : true,
"properties": {
"content" : {
"tye" : "text",
"properties" : {
"en" : {
"type" : "text",
"analyzer" : "english_analyzer"
},
"de" : {
"type" : "text",
"analyzer" : "german_analyzer"
}
}
}
}

There are small syntax errors.
You have your last filter object outside the analysis context.
You cannot have same keys multiple times in a JSON.
So, below settings would help
{
"analysis": {
"analyzer": {
"english_analyzer": {
"type": "custom",
"filter": [
"lowercase",
"english_stop",
"ngram_filter_en"
],
"tokenizer": "whitespace"
}
},
"filter": {
"english_stop": {
"type": "stop"
},
"ngram_filter_en": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 25
},
"german_stop": {
"type": "stop"
},
"ngram_filter_de": {
"type": "edge_ngram",
"min_ngram": "1",
"max_gram": 25
}
},
"german_analyzer": {
"type": "custom",
"filter": [
"lowercase",
"german_stop",
"ngram_filter_de"
],
"tokenizer": "whitespace"
}
}
}
To understand the error in your mapping
{
"analysis": {
"analyzer": {
"filter": {
"english_stop": {
"type": "stop"
},
"ngram_filter_en": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 25
}
},
"german_analyzer" : {
"type" : "custom",
"filter" : [
"lowercase",
"german_stop",
"ngram_filter_de"
],
"tokenizer" : "whitespace"
}
},
"filter" : {//**This is outside analysis, you cannot simply add another filter key inside analysis, so you can merge both as above**
"german_stop" : {
"type" : "stop"
},
"ngram_filter_de" : {
"type" : "edge_ngram",
"min_ngram" : "1",
"max_gram" : 25
}
}

Elastic Search Highlight Not Working With Custom Analyzer/Tokenizer

I can't figure out why highlight is not working. The query works but highlight just shows the field content without em tags. Here is my settings and mappings:
PUT wmsearch
{
"settings": {
"index.mapping.total_fields.limit": 2000,
"analysis": {
"analyzer": {
"custom": {
"type": "custom",
"tokenizer": "custom_token",
"filter": [
"lowercase"
]
},
"custom2": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"custom_token": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10
}
}
}
},
"mappings": {
"doc": {
"properties": {
"document": {
"properties": {
"reference": {
"type": "text",
"analyzer": "custom"
}
}
},
"scope" : {
"type" : "nested",
"properties" : {
"level" : {
"type" : "integer"
},
"ancestors" : {
"type" : "keyword",
"index" : "true"
},
"value" : {
"type" : "keyword",
"index" : "true"
},
"order" : {
"type" : "integer"
}
}
}
}
}
}
}
Here is my query:
GET wmsearch/_search
{
"query": {
"simple_query_string" : {
"fields": ["document.reference"],
"analyzer": "custom2",
"query" : "bloom"
}
},
"highlight" : {
"fields" : {
"document.reference" : {}
}
}
}
The query does return the correct results and highlight field exists within results. However, there is not em tags around "bloom". Rather, it just shows the entire string with no tags at all.
Does anyone see any issues here or can help?
Thanks

I got it to work by adding "index_options": "offsets" to my mappings for document.reference.

Completion Suggester Foreign Language Accents Greek

I am trying to use the Completion suggester with Greek language. Unfortunately I have problems with accents like ά. I've tried a few ways.
One was simply to set the greek analyzer in the mapping the other a lowercase analyzer with asciifolding. No success, with greek analyser I dont even get a result with the accent.
Below is what I did, would be great if anyone can help me out here.
Mapping
PUT t1
{
"mappings": {
"profession" : {
"properties" : {
"text" : {
"type" : "keyword"
},
"suggest" : {
"type" : "completion",
"analyzer": "greek"
}
}
}
}
}
Dummy
POST t1/profession/?refresh
{
"suggest" : {
"input": [ "Μάγειρας"]
}
,"text": "Μάγειρας"
}
Query
GET t1/profession/_search
{ "suggest":
{ "profession" :
{ "prefix" : "Μα"
, "completion" :
{ "field" : "suggest"}
}}}

I found a way to do it with a custom analyzer or via a plugin for es which i highly recommend when it comes to non-latin texts.
Option 1
PUT t1
{ "settings":
{ "analysis":
{ "filter":
{ "greek_lowercase":
{ "type": "lowercase"
, "language": "greek"
}
}
, "analyzer":
{ "autocomplete":
{ "tokenizer": "lowercase"
, "filter":
[ "greek_lowercase" ]
}
}
}}
, "mappings": {
"profession" : {
"properties" : {
"text" : {
"type" : "keyword"
},
"suggest" : {
"type" : "completion",
"analyzer": "autocomplete"
}
}}}
}
Option 2 ICU Plugin
Install ES Plugin:
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html
{ "settings": {
"index": {
"analysis": {
"normalizer": {
"latin": {
"filter": [
"custom_latin_transform"
]
}
},
"analyzer": {
"latin": {
"tokenizer": "keyword",
"filter": [
"custom_latin_transform"
]
}
},
"filter": {
"noDelimiter": {"type": "word_delimiter"},
"custom_latin_transform": {
"type": "icu_transform",
"id": "Greek-Latin/UNGEGN; Lower(); NFD; [:Nonspacing Mark:] Remove; NFC"
}
}
}
}
}
, "mappings":
{ "doc" : {
"properties" : {
"verbose" : {
"type" : "keyword"
},
"name" : {
"type" : "keyword"
},
"slugHash":{
"type" : "keyword",
"normalizer": "latin"
},
"level": { "type": "keyword" },
"hirarchy": {
"type" : "keyword"
},
"geopoint": { "type": "geo_point" },
"suggest" :
{ "type" : "completion"
, "analyzer": "latin"
, "contexts":
[ { "name": "level"
, "type": "category"
, "path": "level"
}
]
}}
}
}}

Elasticsearch query multiple types with different bool

I have an index with 3 different types of content: ['media','group',user'] and I need to do a search at the three at the same type, but requesting some extra parameters that one of them must accomplish before adding to the results list.
Here is my current index data:
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"media": {
"_all": {
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"properties": {
"UID": {
"type": "integer",
"include_in_all": false
},
"addtime": {
"type": "integer",
"include_in_all": false
},
"title": {
"type": "string",
"index": "not_analyzed"
}
}
},
"group": {
"_all": {
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"properties": {
"UID": {
"type": "integer",
"include_in_all": false
},
"name": {
"type": "string",
"index": "not_analyzed"
},
"desc": {
"type": "string",
"include_in_all": false
}
}
},
"user": {
"_all": {
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"properties": {
"addtime": {
"type": "integer",
"include_in_all": false
},
"username": {
"type": "string"
}
}
}
}
}
So currently I can make a search on all the index with
{
query: {
match: {
_all: {
"query": "foo",
"operator": "and"
}
}
}
}
and get the results for media, groups or users with the word "foo" on it, which is great, but I need to make it remove all the media on which the user is not the owner of the results. So I guess I need to do a bool query where I set the "must" clause and add the 'UID' variable to whatever the current user ID is.
My problem is how to do this and how to specify that the filter will work just on one type while leaving the others untouched.
I haven't been able to find an answer on the Elastic Search documentation

At the end I was able to accomplish this by following Andrei's comments. I know it is not perfect since I had to add a should with the types "group" and "user", but it fit perfectly with my design since I need to put more filters on those too. Be advice that the search will end up being slower.
curl -X GET 'http://localhost:9200/foo/_search' -d '
{
"query": {
"bool" :
{
"must" :
{
"query" : {
"match" :
{
"_all":
{
"query" : "test"
}
}
}
},
"filter":
{
"bool":
{
"should":
[{
"bool" : {
'must':
[{
"type":
{
"value": "media"
}
},
{
'bool':
{
"should" : [
{ "term" : {"UID" : 2}},
{ "term" : {"type" : "public"}}
]
}
}]
}
},
{
"bool" : {
"should" : [
{ "type" : {"value" : "group"}},
{ "type" : {"value" : "user"}}
]
}
}]
}
}
}
}
}'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch fuzziness breaks plural - elasticsearch

Related

Elasticsearch sort script not working as expected for few documents only

Using different language analyzers with ngram Analyzer in one mapping in Elasticsearch

Elastic Search Highlight Not Working With Custom Analyzer/Tokenizer

Completion Suggester Foreign Language Accents Greek

Elasticsearch query multiple types with different bool

Categories

Resources