elasticsearch Updating Index settings analyzer - elasticsearch

I have a Books Index which contains multiple subjects
chemistry
biology
etc
Each subject have there own set of synonyms and a global synonyms
PUT /books/_settings
{
"analysis": {
"filter": {
"biology_synonyms": {
"type": "synonym",
"synonyms": [
"a, aa, aaa"
]
},
"chemistry_synonyms": {
"type": "synonym",
"synonyms": [
"c, cc, ccc"
]
},
"global_synonyms": {
"type": "synonym",
"synonym": [
"x, xx, xxx"
]
}
},
"analyzer": {
"chemistry_analyzer": {
"filter": [
"global_synonyms", "chemistry_synonyms"
]
},
"biology_analyzer": {
"filter": [
"global_synonyms", "biology_synonyms"
]
}
}
}
}
Let's say at any point in time, I want to add new subject named "Astronomy"
Now the problem is how do I Update the index settings to add new "Astronomy_synonyms" and "Astronomy_analyzer"
my application requires me to append settings with existing filters and analyzers, I don't want to overwrite(replace settings)

You can definitely append new token filters and analyzers, however you need to close your index before updating the settings and reopen it when done. In what follows, I assume the index already exists.
Let's say you create your index with the following initial settings:
PUT /books
{
"settings": {
"analysis": {
"filter": {
"biology_synonyms": {
"type": "synonym",
"synonyms": [
"a, aa, aaa"
]
},
"chemistry_synonyms": {
"type": "synonym",
"synonyms": [
"c, cc, ccc"
]
},
"global_synonyms": {
"type": "synonym",
"synonyms": [
"x, xx, xxx"
]
}
},
"analyzer": {
"chemistry_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"global_synonyms",
"chemistry_synonyms"
]
},
"biology_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"global_synonyms",
"biology_synonyms"
]
}
}
}
}
}
Then you need to close your index:
POST books/_close
Then you can append new analyzers and token filters:
PUT /books/_settings
{
"analysis": {
"filter": {
"astronomy_synonyms": {
"type": "synonym",
"synonyms": [
"x, xx, xxx"
]
}
},
"analyzer": {
"astronomy_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"global_synonyms",
"astronomy_synonyms"
]
}
}
}
}
And finally reopen your index
POST books/_open
If you then check your index settings, you'll see that everything has been properly merged.

You can only define new analyzers on closed indices.
To add an analyzer, you must close the index, define the analyzer, and reopen the index.
POST /books/_close
PUT /books/_settings
{
"analysis": {
"filter": {
"astronomy_synonyms": {
"type": "synonym",
"synonyms": [
"a, aa, aaa=>a"
]
}
},
"analyzer": {
"astronomy_analyzer": {
"tokenizer" : "whitespace",
"filter": [
"global_synonyms", "astronomy_synonyms"
]
}
}
}
}
POST /books/_open

Related

Subwordsearch in elasticsearch using ngram does not work

I would like to perform a simple_query_string search in Elasticsearch while having a sub-word matching.
For example if a would have a filename: "C:\Users\Sven Onderbeke\Documents\Arduino"
Than I would want this filename listed if my searchterm is for example "ocumen".
This thread suggested to use ngram to match with parts of the word. I tried to implement it as follows (in Python) but I get zero results while I expect one:
test_mapping = {
"properties": {
"filename": {
"type": "text",
"analyzer": "my_index_analyzer"
},
}
}
def create_index(index_name, mapping):
created = False
# index settings
settings = {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
},
"analysis": {
"index_analyzer": {
"my_index_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"mynGram"
]
}
},
"search_analyzer": {
"my_search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"mynGram"
]
}
},
"filter": {
"mynGram": {
"type": "nGram",
"min_gram": 2,
"max_gram": 50
}
}
},
"mappings": mapping
}
try:
if not es.indices.exists(index_name):
# Ignore 400 means to ignore "Index Already Exist" error.
es.indices.create(index=index_name, ignore=400, body=settings)
print(f'Created Index: {index_name}')
created = True
except Exception as ex:
print(str(ex))
finally:
return created
create_index("test", test_mapping)
doc = {
'filename': r"C:\Users\Sven Onderbeke\Documents\Arduino",
}
es.index(index="test", document=doc)
needle = "ocumen"
q = {
"simple_query_string": {
"query": needle,
"default_operator": "and"
}
}
res = es.search(index="test", query=q)
print(res)
for hit in res['hits']['hits']:
print(hit)
The reason your solution isn't working is because you haven't provided analyzer on the property named as field while defining mapping. Update mapping as below and then reindex all documents.
test_mapping = {
"properties": {
"filename": {
"type": "text",
"analyzer": "my_index_analyzer"
},
}
}

Preventing tokenisation of certain alpha characters in ElasticSearch

I would like to prevent - and / from being tokenised or stemmed for a particular field.
I thought I had some code to achieve such behavior:
"char_filters": {
"type": "word_delimiter",
"type_table": [
"- => ALPHA",
"/ => ALPHA"
]
},
However, it errors:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Token filter [char_filters] cannot be used to parse synonyms"
}
],
"type": "illegal_argument_exception",
"reason": "Token filter [char_filters] cannot be used to parse synonyms"
},
"status": 400
}
Looking online I've found PatternReplaceFilterFactory and a few other methods, however these substitute the characters. I wish for the interpreter to handle the two chars as strings.
So I would like the string 5/3mm to be tokenised as such. Not split into 5 and 3mm.
Please could someone advise the correct way to achieve this? Here's a simplified PUT and some POST/Analyse requests.
// doc 1 contains what I would like to match
POST /products_example/_doc/1
{
"ProductDescription_stripped":"RipCurl 5/3mm wetsuit omega",
"ProductDescription_da_stripped": "RipCurl 5/3mm wetsuit omega"
}
// doc 2 contains only 3mm. Should be prioritised below 5/3mm (match 1)
POST /products_example/_doc/2
{
"ProductDescription_stripped":"RipCurl 3mm wetsuit omega",
"ProductDescription_da_stripped": "RipCurl 5/3mm wetsuit omega"
}
// here you can see 3mm have been tokenised where as 5/3mm should have been preserved
POST /products_example/_analyze
{
"tokenizer": "standard",
"filter": [ "lowercase","asciifolding","synonym","stop","kstem"],
"text": "5/3mm ripcurl wetsuit omega"
}
PUT /products/
{
"settings": {
"index.mapping.total_fields.limit": 1000000,
"index.max_ngram_diff" : 2,
"analysis": {
"filter": {
"char_filters": {
"type": "word_delimiter",
"type_table": [
"- => ALPHA",
"/ => ALPHA"
]
},
"description_stemmer_da" : {"type" : "stemmer","name" : "danish"},
"stop_da" : {"type" : "stop","stopwords": "_danish_"},
"synonym" : {
"type" : "synonym",
"synonyms" : ["ripcurl, ripccurl => rip curl"]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram", "min_gram": 3, "max_gram": 5,
"token_chars": ["letter","digit"]
}
},
"analyzer": {
"description" : {
"type": "custom",
"tokenizer": "standard",
"filter": [
"char_filters",
"lowercase",
"asciifolding",
"synonym",
"stop",
"kstem"
]
},
"description_da": {
"type":"custom", "tokenizer":"standard",
"filter": [
"char_filters",
"lowercase",
"asciifolding",
"synonym",
"stop_da",
"description_stemmer_da"
]
}
}
}
},
"mappings": {
"properties": {
"ProductDescription_stripped": {
"type": "text",
"analyzer" : "description"
},
"ProductDescription_da_stripped": {
"type": "text",
"analyzer": "danish"
}
}
}
}

custom char filter of elastic search is not working

I have the custom analyzer shown below
{
"settings": {
"analysis": {
"analyzer": {
"query_logging_analyzer": {
"tokenizer": "whitespace",
"char_filter": [
{
"type": "mapping",
"mappings": [
"'=>\\u0020",
"{=>\\u0020",
"}=>\\u0020",
",=>\\u0020",
":=>\\u0020",
"[=>\\u0020",
"]=>\\u0020"
]
}
]
}
}
}
}
}
basically I want an analyzer that takes in a json string and emits out the keys of that json. For that purpose, I need to tidy up the json by removing json entities such as [ ] { } , : "
This somehow is not working. The char_filter isnt taking effect at all
I had a similar char_filter using pattern_replace which didn't work too
"char_filter": [
{
"type": "pattern_replace",
"pattern": "\"|\\,|\\:|\\'|\\{|\\}",
"replacement": ""
}
]
What exactly is 'not working'?
Because your custom Analyzer seems to work just fine?
GET _analyze?filter_path=tokens.token
{
"tokenizer": "whitespace",
"text": [
"""{ "names": [ "Foo", "Bar" ] }"""
],
"char_filter": [
{
"type": "mapping",
"mappings": [
"'=>\\u0020",
"{=>\\u0020",
"}=>\\u0020",
",=>\\u0020",
":=>\\u0020",
"[=>\\u0020",
"]=>\\u0020"
]
}
]
}
Returns:
{
"tokens": [
{
"token": """"names""""
},
{
"token": """"Foo""""
},
{
"token": """"Bar""""
}
]
}

Custom predefined stopword list in Elasticsearch

How can i define a custom stopword list globally in a way it is accessible from all indexes.
it would be ideal to use this stopword list just like the way we use predefined language-specific stopword lists:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_my_predefined_stopword_list_"
}
}
}
}
}
The official elastcisearch documentation describes how to create a custom filter with a list of stopwords. You can find the description here:
https://www.elastic.co/guide/en/elasticsearch/guide/current/using-stopwords.html
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": [ "si", "esta", "el", "la" ]
},
"light_spanish": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"my_spanish": {
"tokenizer": "spanish",
"filter": [
"lowercase",
"asciifolding",
"spanish_stop",
"light_spanish"
]
}
}
}
}
}
After defining this filter spanish_stop you can use it in the definition of your indices.

Elastic search cluster level analyzer

How can I define one custom analyzer that will be used in more than one index (in a cluster level)? All the examples I can find shows how to create a custom analyzer on a specific index.
My analyzer for example:
PUT try_index
{
"settings": {
"analysis": {
"filter": {
"od_synonyms": {
"type": "synonym",
"synonyms": [
"dog, cat => animal",
"john, lucas => boy",
"emma, kate => girl"
]
}
},
"analyzer": {
"od_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"od_synonyms"
]
}
}
}
},
"mappings": {
"record": {
"properties": {
"name": {
"type": "string",
"analyzer":"standard",
"search_analyzer": "od_analyzer"
}
}
}
}
}
Any idea how to change my analyzer scope to cluster level?
thanks
There is no "scope" for analyzers. But you can do something similar with index templates:
PUT /_template/some_name_here
{
"template": "a*",
"order": 0,
"settings": {
"analysis": {
"filter": {
"od_synonyms": {
"type": "synonym",
"synonyms": [
"dog, cat => animal",
"john, lucas => boy",
"emma, kate => girl"
]
}
},
"analyzer": {
"od_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"od_synonyms"
]
}
}
}
}
}
And at "template" you should put the name of the indices that this template should be applied to when the index is created. You could very well specify "*" and matching all the indices. I think that's the best you can do for what you want.

Resources