custom char filter of elastic search is not working - elasticsearch

I have the custom analyzer shown below
{
"settings": {
"analysis": {
"analyzer": {
"query_logging_analyzer": {
"tokenizer": "whitespace",
"char_filter": [
{
"type": "mapping",
"mappings": [
"'=>\\u0020",
"{=>\\u0020",
"}=>\\u0020",
",=>\\u0020",
":=>\\u0020",
"[=>\\u0020",
"]=>\\u0020"
]
}
]
}
}
}
}
}
basically I want an analyzer that takes in a json string and emits out the keys of that json. For that purpose, I need to tidy up the json by removing json entities such as [ ] { } , : "
This somehow is not working. The char_filter isnt taking effect at all
I had a similar char_filter using pattern_replace which didn't work too
"char_filter": [
{
"type": "pattern_replace",
"pattern": "\"|\\,|\\:|\\'|\\{|\\}",
"replacement": ""
}
]

What exactly is 'not working'?
Because your custom Analyzer seems to work just fine?
GET _analyze?filter_path=tokens.token
{
"tokenizer": "whitespace",
"text": [
"""{ "names": [ "Foo", "Bar" ] }"""
],
"char_filter": [
{
"type": "mapping",
"mappings": [
"'=>\\u0020",
"{=>\\u0020",
"}=>\\u0020",
",=>\\u0020",
":=>\\u0020",
"[=>\\u0020",
"]=>\\u0020"
]
}
]
}
Returns:
{
"tokens": [
{
"token": """"names""""
},
{
"token": """"Foo""""
},
{
"token": """"Bar""""
}
]
}

Related

elasticsearch Updating Index settings analyzer

I have a Books Index which contains multiple subjects
chemistry
biology
etc
Each subject have there own set of synonyms and a global synonyms
PUT /books/_settings
{
"analysis": {
"filter": {
"biology_synonyms": {
"type": "synonym",
"synonyms": [
"a, aa, aaa"
]
},
"chemistry_synonyms": {
"type": "synonym",
"synonyms": [
"c, cc, ccc"
]
},
"global_synonyms": {
"type": "synonym",
"synonym": [
"x, xx, xxx"
]
}
},
"analyzer": {
"chemistry_analyzer": {
"filter": [
"global_synonyms", "chemistry_synonyms"
]
},
"biology_analyzer": {
"filter": [
"global_synonyms", "biology_synonyms"
]
}
}
}
}
Let's say at any point in time, I want to add new subject named "Astronomy"
Now the problem is how do I Update the index settings to add new "Astronomy_synonyms" and "Astronomy_analyzer"
my application requires me to append settings with existing filters and analyzers, I don't want to overwrite(replace settings)
You can definitely append new token filters and analyzers, however you need to close your index before updating the settings and reopen it when done. In what follows, I assume the index already exists.
Let's say you create your index with the following initial settings:
PUT /books
{
"settings": {
"analysis": {
"filter": {
"biology_synonyms": {
"type": "synonym",
"synonyms": [
"a, aa, aaa"
]
},
"chemistry_synonyms": {
"type": "synonym",
"synonyms": [
"c, cc, ccc"
]
},
"global_synonyms": {
"type": "synonym",
"synonyms": [
"x, xx, xxx"
]
}
},
"analyzer": {
"chemistry_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"global_synonyms",
"chemistry_synonyms"
]
},
"biology_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"global_synonyms",
"biology_synonyms"
]
}
}
}
}
}
Then you need to close your index:
POST books/_close
Then you can append new analyzers and token filters:
PUT /books/_settings
{
"analysis": {
"filter": {
"astronomy_synonyms": {
"type": "synonym",
"synonyms": [
"x, xx, xxx"
]
}
},
"analyzer": {
"astronomy_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"global_synonyms",
"astronomy_synonyms"
]
}
}
}
}
And finally reopen your index
POST books/_open
If you then check your index settings, you'll see that everything has been properly merged.
You can only define new analyzers on closed indices.
To add an analyzer, you must close the index, define the analyzer, and reopen the index.
POST /books/_close
PUT /books/_settings
{
"analysis": {
"filter": {
"astronomy_synonyms": {
"type": "synonym",
"synonyms": [
"a, aa, aaa=>a"
]
}
},
"analyzer": {
"astronomy_analyzer": {
"tokenizer" : "whitespace",
"filter": [
"global_synonyms", "astronomy_synonyms"
]
}
}
}
}
POST /books/_open

Elasticsearch - Special Characters in Query String

I'm having trouble trying to search special characters using query string. I need to search an email address in format "xxx#xxx.xxx". At index time I use a custom normalizer which provide lowercase and ascii folding. At search time I use a custom analyzer which provide a tokenizer for whitespace and a filter that apply lowercase and ascii folding. By the way I am not able to search for a simple email address.
This is my mapping
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1,
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"normalizer": {
"lowerasciinormalizer": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"email": {
"type": "keyword",
"normalizer": "lowerasciinormalizer"
}
}
}
And this is my search query
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "pippo#pluto.it",
"fields": [
"email"
],
"analyzer": "folding"
}
}
]
}
}
}
Searching without special characters works fine. Infact if I do "query": "pippo*" I get the correct results.
I also tested the tokenizer doing
GET /_analyze
{
"analyzer": "whitespace",
"text": "pippo#pluto.com"
}
I get what I expect
{
"tokens" : [
{
"token" : "pippo#pluto.com",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 0
}
]
}
Any suggestions?
Thanks.
Edit:
I'm using elasticsearch 7.5.1
This works right. My problem was somewhere else.

How to remove spaces inbetween words before indexing

How do I remove spaces between words before indexing?
For example:
I want to be able to search for 0123 7784 9809 7893
when I query "0123 7784 9809 7893", "0123778498097893", or "0123-7784-9809-7893"
My idea is to remove all spaces and dashes and combine the partial into a whole string (0123 7784 9809 7893 to 0123778498097893) before indexing, and also adding an analyzer in the query part so as to find my desired result.
I have tried
"char_filter" : {
"neglect_dash_and_space_filter" : {
"type" : "mapping",
"mappings" : [
"- => ",
"' ' => "
]
}
It seems that only dash is removed but not spaces. Tested custom shingle, but still not working. Kindly advice. Thanks.
You can use pattern replace filter
{
"mappings": {
"properties": {
"field1": {
"type": "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"analysis": {
"filter": {
"whitespace_remove": {
"type": "pattern_replace",
"pattern": "[^0-9]", ---> it will replace anything other than digits
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"whitespace_remove"
]
}
}
}
}
}
You can use \uXXXX notation for spaces:
EDIT1:
PUT index41
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"\\u0020 => ",
"- => "
]
}
}
}
}
}

Elastic search cluster level analyzer

How can I define one custom analyzer that will be used in more than one index (in a cluster level)? All the examples I can find shows how to create a custom analyzer on a specific index.
My analyzer for example:
PUT try_index
{
"settings": {
"analysis": {
"filter": {
"od_synonyms": {
"type": "synonym",
"synonyms": [
"dog, cat => animal",
"john, lucas => boy",
"emma, kate => girl"
]
}
},
"analyzer": {
"od_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"od_synonyms"
]
}
}
}
},
"mappings": {
"record": {
"properties": {
"name": {
"type": "string",
"analyzer":"standard",
"search_analyzer": "od_analyzer"
}
}
}
}
}
Any idea how to change my analyzer scope to cluster level?
thanks
There is no "scope" for analyzers. But you can do something similar with index templates:
PUT /_template/some_name_here
{
"template": "a*",
"order": 0,
"settings": {
"analysis": {
"filter": {
"od_synonyms": {
"type": "synonym",
"synonyms": [
"dog, cat => animal",
"john, lucas => boy",
"emma, kate => girl"
]
}
},
"analyzer": {
"od_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"od_synonyms"
]
}
}
}
}
}
And at "template" you should put the name of the indices that this template should be applied to when the index is created. You could very well specify "*" and matching all the indices. I think that's the best you can do for what you want.

multiple like query in elastic search

I have a field path in my elastic-search documents which has entries like this
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_011007/stderr
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_008874/stderr
#*Note -- I want to select all the documents having below line in the **path** field
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_009257/stderr
I want to make a like query on this path field given certain things(basically an AND condition on all the 3):-
I have given application number 1451299305289_0120
I have also given a task number 009257
The path field should also contain stderr
Given the above criteria the document having the path field as the 3rd line should be selected
This is what I have tries so far
http://localhost:9200/logstash-*/_search?q=application_1451299305289_0120 AND path:stderr&size=50
This query fulfills the 3rd criteria, and partially the 1st criteria i.e if I search for 1451299305289_0120 instead of application_1451299305289_0120, I got 0 results. (What I really need is like search on 1451299305289_0120)
When I tried this
http://10.30.145.160:9200/logstash-*/_search?q=path:*_1451299305289_0120*008779 AND path:stderr&size=50
I got the result, but using * at the start is a costly operation. Is their another way to achieve this effectively (like using nGram and using fuzzy-search of elastic-search)
This can be achieved by using Pattern Replace Char Filter. You just extract only important bits of information with regex. This is my setup
POST log_index
{
"settings": {
"analysis": {
"analyzer": {
"app_analyzer": {
"char_filter": [
"app_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
},
"path_analyzer": {
"char_filter": [
"path_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
},
"task_analyzer": {
"char_filter": [
"task_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"char_filter": {
"app_extractor": {
"type": "pattern_replace",
"pattern": ".*application_(.*)/container.*",
"replacement": "$1"
},
"path_extractor": {
"type": "pattern_replace",
"pattern": ".*/(.*)",
"replacement": "$1"
},
"task_extractor": {
"type": "pattern_replace",
"pattern": ".*container.{27}(.*)/.*",
"replacement": "$1"
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "keyword",
"fields": {
"application_number": {
"type": "string",
"analyzer": "app_analyzer"
},
"path": {
"type": "string",
"analyzer": "path_analyzer"
},
"task": {
"type": "string",
"analyzer": "task_analyzer"
}
}
}
}
}
}
}
I am extracting application number, task number and path with regex. You might want to optimize task regex a bit if you have some other log pattern, then we can use Filters to search.A big advantage of using filters is that they are cached and make subsequent calls faster.
I indexed sample log like this
PUT log_index/your_type/1
{
"name" : "/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_009257/stderr"
}
This query will give you desired results
GET log_index/_search
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"name.application_number": "1451299305289_0120"
}
},
{
"term": {
"name.task": "009257"
}
},
{
"term": {
"name.path": "stderr"
}
}
]
}
}
}
}
}
On a side note filtered query is deprecated in ES 2.x, just use filter directly.Also path hierarchy might be useful for some other uses
Hope this helps :)

Resources