Elastic search cluster level analyzer - elasticsearch

How can I define one custom analyzer that will be used in more than one index (in a cluster level)? All the examples I can find shows how to create a custom analyzer on a specific index.
My analyzer for example:
PUT try_index
{
"settings": {
"analysis": {
"filter": {
"od_synonyms": {
"type": "synonym",
"synonyms": [
"dog, cat => animal",
"john, lucas => boy",
"emma, kate => girl"
]
}
},
"analyzer": {
"od_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"od_synonyms"
]
}
}
}
},
"mappings": {
"record": {
"properties": {
"name": {
"type": "string",
"analyzer":"standard",
"search_analyzer": "od_analyzer"
}
}
}
}
}
Any idea how to change my analyzer scope to cluster level?
thanks

There is no "scope" for analyzers. But you can do something similar with index templates:
PUT /_template/some_name_here
{
"template": "a*",
"order": 0,
"settings": {
"analysis": {
"filter": {
"od_synonyms": {
"type": "synonym",
"synonyms": [
"dog, cat => animal",
"john, lucas => boy",
"emma, kate => girl"
]
}
},
"analyzer": {
"od_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"od_synonyms"
]
}
}
}
}
}
And at "template" you should put the name of the indices that this template should be applied to when the index is created. You could very well specify "*" and matching all the indices. I think that's the best you can do for what you want.

Related

Subwordsearch in elasticsearch using ngram does not work

I would like to perform a simple_query_string search in Elasticsearch while having a sub-word matching.
For example if a would have a filename: "C:\Users\Sven Onderbeke\Documents\Arduino"
Than I would want this filename listed if my searchterm is for example "ocumen".
This thread suggested to use ngram to match with parts of the word. I tried to implement it as follows (in Python) but I get zero results while I expect one:
test_mapping = {
"properties": {
"filename": {
"type": "text",
"analyzer": "my_index_analyzer"
},
}
}
def create_index(index_name, mapping):
created = False
# index settings
settings = {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
},
"analysis": {
"index_analyzer": {
"my_index_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"mynGram"
]
}
},
"search_analyzer": {
"my_search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"mynGram"
]
}
},
"filter": {
"mynGram": {
"type": "nGram",
"min_gram": 2,
"max_gram": 50
}
}
},
"mappings": mapping
}
try:
if not es.indices.exists(index_name):
# Ignore 400 means to ignore "Index Already Exist" error.
es.indices.create(index=index_name, ignore=400, body=settings)
print(f'Created Index: {index_name}')
created = True
except Exception as ex:
print(str(ex))
finally:
return created
create_index("test", test_mapping)
doc = {
'filename': r"C:\Users\Sven Onderbeke\Documents\Arduino",
}
es.index(index="test", document=doc)
needle = "ocumen"
q = {
"simple_query_string": {
"query": needle,
"default_operator": "and"
}
}
res = es.search(index="test", query=q)
print(res)
for hit in res['hits']['hits']:
print(hit)
The reason your solution isn't working is because you haven't provided analyzer on the property named as field while defining mapping. Update mapping as below and then reindex all documents.
test_mapping = {
"properties": {
"filename": {
"type": "text",
"analyzer": "my_index_analyzer"
},
}
}

elasticsearch Updating Index settings analyzer

I have a Books Index which contains multiple subjects
chemistry
biology
etc
Each subject have there own set of synonyms and a global synonyms
PUT /books/_settings
{
"analysis": {
"filter": {
"biology_synonyms": {
"type": "synonym",
"synonyms": [
"a, aa, aaa"
]
},
"chemistry_synonyms": {
"type": "synonym",
"synonyms": [
"c, cc, ccc"
]
},
"global_synonyms": {
"type": "synonym",
"synonym": [
"x, xx, xxx"
]
}
},
"analyzer": {
"chemistry_analyzer": {
"filter": [
"global_synonyms", "chemistry_synonyms"
]
},
"biology_analyzer": {
"filter": [
"global_synonyms", "biology_synonyms"
]
}
}
}
}
Let's say at any point in time, I want to add new subject named "Astronomy"
Now the problem is how do I Update the index settings to add new "Astronomy_synonyms" and "Astronomy_analyzer"
my application requires me to append settings with existing filters and analyzers, I don't want to overwrite(replace settings)
You can definitely append new token filters and analyzers, however you need to close your index before updating the settings and reopen it when done. In what follows, I assume the index already exists.
Let's say you create your index with the following initial settings:
PUT /books
{
"settings": {
"analysis": {
"filter": {
"biology_synonyms": {
"type": "synonym",
"synonyms": [
"a, aa, aaa"
]
},
"chemistry_synonyms": {
"type": "synonym",
"synonyms": [
"c, cc, ccc"
]
},
"global_synonyms": {
"type": "synonym",
"synonyms": [
"x, xx, xxx"
]
}
},
"analyzer": {
"chemistry_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"global_synonyms",
"chemistry_synonyms"
]
},
"biology_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"global_synonyms",
"biology_synonyms"
]
}
}
}
}
}
Then you need to close your index:
POST books/_close
Then you can append new analyzers and token filters:
PUT /books/_settings
{
"analysis": {
"filter": {
"astronomy_synonyms": {
"type": "synonym",
"synonyms": [
"x, xx, xxx"
]
}
},
"analyzer": {
"astronomy_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"global_synonyms",
"astronomy_synonyms"
]
}
}
}
}
And finally reopen your index
POST books/_open
If you then check your index settings, you'll see that everything has been properly merged.
You can only define new analyzers on closed indices.
To add an analyzer, you must close the index, define the analyzer, and reopen the index.
POST /books/_close
PUT /books/_settings
{
"analysis": {
"filter": {
"astronomy_synonyms": {
"type": "synonym",
"synonyms": [
"a, aa, aaa=>a"
]
}
},
"analyzer": {
"astronomy_analyzer": {
"tokenizer" : "whitespace",
"filter": [
"global_synonyms", "astronomy_synonyms"
]
}
}
}
}
POST /books/_open

Elasticsearch index analyzers seem to do nothing after being added

New to ES and following the docs (https://www.elastic.co/guide/en/elasticsearch/guide/current/languages.html) on using different analzers to deal with human language. After following some of the examples, it appears as though the added analyzers are having no effect on searches at all. Eg.
## init some index for testing
PUT /testindex
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 3,
"analysis": {},
"refresh_interval": "1s"
},
"mappings": {
"testtype": {
"properties": {
"title": {
"type": "text",
"analyzer": "english"
}
}
}
}
}
## adding some analyzers for...
POST /testindex/_close
##... simple lowercase tokenization, ...(https://www.elastic.co/guide/en/elasticsearch/guide/current/lowercase-token-filter.html#lowercase-token-filter)
PUT /testindex/_settings
{
"analysis": {
"analyzer": {
"my_lowercaser": {
"tokenizer": "standard",
"filter": [ "lowercase" ]
}
}
}
}
## ... normalization (https://www.elastic.co/guide/en/elasticsearch/guide/current/algorithmic-stemmers.html#_using_an_algorithmic_stemmer), ...
PUT testindex/_settings
{
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"light_english_stemmer": {
"type": "stemmer",
"language": "light_english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"light_english_stemmer",
"asciifolding"
]
}
}
}
}
## ... and using a hunspell dictionary (https://www.elastic.co/guide/en/elasticsearch/guide/current/hunspell.html#hunspell)
PUT testindex/_settings
{
"analysis": {
"filter": {
"en_US": {
"type": "hunspell",
"language": "en_US"
}
},
"analyzer": {
"en_US": {
"tokenizer": "standard",
"filter": [
"lowercase",
"en_US"
]
}
}
}
}
POST /testindex/_open
GET testindex/_settings
## it appears as though the analyzers have been added without problem
## adding some testing data
POST /testindex/testtype
{
"title": "Will the root word of movement be found?"
}
POST /testindex/testtype
{
"title": "That's why I never want to hear you say, ehhh I waant it thaaat away."
}
## expecting to match against root word of movement (move)
GET /testindex/testtype/_search
{
"query": {
"match": {
"title": "moving"
}
}
}
## which returns 0 hits, as shown below
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
## ... yet I can see that the record expected does in fact exist in the index when using...
GET /testindex/testtype/_search
{
"query": {
"match_all": {}
}
}
Thinking then that I need to actually "add" the analyzer to a (new) field, I do the following (which still shows negative results)
# adding the analyzers to a new field
POST /testindex/testtype
{
"mappings": {
"properties": {
"title2": {
"type": "text",
"analyzer": [
"my_lowercaser",
"english",
"en_US"
]
}
}
}
}
# looking at the tokens I'd expect to be able to find
GET /testindex/_analyze
{
"analyzer": "en_US",
"text": "Moving between directories"
}
# moving, move, between, directory
# what I actually see
GET /testindex/_analyze
{
"field": "title2",
"text": "Moving between directories"
}
# moving, between, directories
Even trying something simpler like
POST /testindex/testtype
{
"mappings": {
"properties": {
"title2": {
"type": "text",
"analyzer": "en_US"
}
}
}
}
does not help at all.
So this seems very messed up. Am I missing something here about how these analyzers are supposed to work? Should these analyzers be working properly (based on the provided info) and I am simply misusing them here? If so, could someone please provide an example query that would actually work/hit?
** Is there other debugging information that should be added here?
title2 field has 3 analyzers, but according to your output(analyze endpoint) it seems that only my_lowercaser is applied.
Finally, the config that worked for me with hunspell is:
"settings": {
"analysis": {
"filter": {
"en_US": {
"type": "hunspell",
"language": "en_US"
}
},
"analyzer": {
"en_US": {
"tokenizer": "standard",
"filter": [ "lowercase", "en_US" ]
}
}
}
}
"mappings": {
"_doc": {
"properties": {
"title-en-us": {
"type": "text",
"analyzer": "en_US"
}
}
}
}
movement is not resolved to move while moving is(probably hunspell dictionary related). Querying with move resulted in docs with moving only, but not movement.

Elastic Search - how to use language analyzer with UTF-8 filter?

I have a problem with ElasticSearch language analyzer. I am working on Lithuanian language, so I am using Lithuanian language analyzer. Analyzer works fine and I got all word cases I need. For example, I index Lithuania city "Klaipėda":
PUT /cities/city/1
{
"name": "Klaipėda"
}
Problem is that I also need to get a result, when I am searching "Klaipėda" only in Latin alphabet ("Klaipeda") and in all Lithuanian cases:
Nomanitive case: "Klaipeda"
Genitive case: "Klaipedos"
...
Locative case: "Klaipedoje"
"Klaipėda", "Klaipėdos", "Klaipėdoje" - works, but "Klaipeda", "Klaipedos", "Klaipedoje" - not.
My index:
PUT /cities
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"fields": {
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"md_folded_analyzer": {
"type": "lithuanian",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"lithuanian_stop",
"lithuanian_keywords",
"lithuanian_stemmer"
]
}
}
}
}
}
and search query:
GET /cities/_search
{
"query": {
"multi_match" : {
"type": "most_fields",
"query": "klaipeda",
"fields": [ "name", "name.folded" ]
}
}
}
What I am doing wrong? Thanks for help.
The technique you are using here is so-called multi-fields. The limitation of the underlying name.folded field is that you can't perform search against it - you can perform only sorting by name.folded and aggregation.
To make a way round this I've come up with the following set-up:
Separate fields set-up (to eliminate duplicates - just specify copy_to):
curl -XPUT http://localhost:9200/cities -d '
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"copy_to": "folded",
},
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}'
Change the type of your analyzer to custom as it described here, because otherwise the asciifolding is not got into the config. And more important - asciifolding should go after all stemming / stop-words in Lithuanian language, because after folding the word can miss desired sense.
curl -XPUT http://localhost:9200/my_cities -d '
{
"settings": {
"analysis": {
"filter": {
"lithuanian_stop": {
"type": "stop",
"stopwords": "_lithuanian_"
},
"lithuanian_stemmer": {
"type": "stemmer",
"language": "lithuanian"
}
},
"analyzer": {
"md_folded_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"lithuanian_stop",
"lithuanian_stemmer",
"asciifolding"
]
}
}
}
}
}
Sorry I've eliminated lithuanian_keywords - it requires additional set-up, which I missed here. But I hope you've got the idea.

Custom predefined stopword list in Elasticsearch

How can i define a custom stopword list globally in a way it is accessible from all indexes.
it would be ideal to use this stopword list just like the way we use predefined language-specific stopword lists:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_my_predefined_stopword_list_"
}
}
}
}
}
The official elastcisearch documentation describes how to create a custom filter with a list of stopwords. You can find the description here:
https://www.elastic.co/guide/en/elasticsearch/guide/current/using-stopwords.html
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": [ "si", "esta", "el", "la" ]
},
"light_spanish": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"my_spanish": {
"tokenizer": "spanish",
"filter": [
"lowercase",
"asciifolding",
"spanish_stop",
"light_spanish"
]
}
}
}
}
}
After defining this filter spanish_stop you can use it in the definition of your indices.

Resources