I have been using synonym in Elasticsearch to map data. I have created a index setting like this
PUT uk-2016.06.22
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"british,uk,england,britain"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
Rather than created manually a library txt document in mapping synonyms, is there any synonym library available for downloading to map data for elasticsearch application? Because it seems difficult for me to find one.
Related
I have more than 4000 keywords I want to indexed by elasticsearch.
I want to pass it the text and extract the existing keywords.
The first problem is that when I pass a few numbers it works but when I pass a lot of keywords it extracts words that are not in the text.
The second problem is that it only extracts the words before and after space.
I want to extract keyword it from inside the word
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
},
"my_analyzer_shingle": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase",
"shingle"
]
}
}
}
}
}
POST /test/your_type/
{
"keyword": "search"
}
POST /test/your_type/_search
{
"query": {
"match": {
"keyword": "elasticsearch"
}
}
}
I have time-based indices
students-2018
students-2019
students-2020
I have defined 1 analyzer with synonyms, I want to reuse the same analyzer across multiple indexes, how do I achieve that?
You can define an index template and then create your custom analyzer with that template which includes all your student indices.
You can add your index-pattern in below index template call as mention in the official doc.
Sample index template def
{
"index_patterns": ["student-*"],
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
Now all your student indices like students-2018 , students-2019 will have this my_custom_analyzer which is defined in the index template.
Create a student index without any setting and analyzer like
http://{{you-es-hostname}}/student-2018
And then check its setting using GET http://{{you-es-hostname}}/student-2018, which would give below output and includes the analyzer created in the index template.
{
"student-2018": {
"aliases": {},
"mappings": {},
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "student-2018",
"creation_date": "1588653678067",
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"filter": [
"lowercase",
"asciifolding"
],
"char_filter": [
"html_strip"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "kjGEgKCOSJeIlrASP-RaMQ",
"version": {
"created": "7040299"
}
}
}
}
}
How can i define a custom stopword list globally in a way it is accessible from all indexes.
it would be ideal to use this stopword list just like the way we use predefined language-specific stopword lists:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_my_predefined_stopword_list_"
}
}
}
}
}
The official elastcisearch documentation describes how to create a custom filter with a list of stopwords. You can find the description here:
https://www.elastic.co/guide/en/elasticsearch/guide/current/using-stopwords.html
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": [ "si", "esta", "el", "la" ]
},
"light_spanish": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"my_spanish": {
"tokenizer": "spanish",
"filter": [
"lowercase",
"asciifolding",
"spanish_stop",
"light_spanish"
]
}
}
}
}
}
After defining this filter spanish_stop you can use it in the definition of your indices.
How can I define one custom analyzer that will be used in more than one index (in a cluster level)? All the examples I can find shows how to create a custom analyzer on a specific index.
My analyzer for example:
PUT try_index
{
"settings": {
"analysis": {
"filter": {
"od_synonyms": {
"type": "synonym",
"synonyms": [
"dog, cat => animal",
"john, lucas => boy",
"emma, kate => girl"
]
}
},
"analyzer": {
"od_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"od_synonyms"
]
}
}
}
},
"mappings": {
"record": {
"properties": {
"name": {
"type": "string",
"analyzer":"standard",
"search_analyzer": "od_analyzer"
}
}
}
}
}
Any idea how to change my analyzer scope to cluster level?
thanks
There is no "scope" for analyzers. But you can do something similar with index templates:
PUT /_template/some_name_here
{
"template": "a*",
"order": 0,
"settings": {
"analysis": {
"filter": {
"od_synonyms": {
"type": "synonym",
"synonyms": [
"dog, cat => animal",
"john, lucas => boy",
"emma, kate => girl"
]
}
},
"analyzer": {
"od_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"od_synonyms"
]
}
}
}
}
}
And at "template" you should put the name of the indices that this template should be applied to when the index is created. You could very well specify "*" and matching all the indices. I think that's the best you can do for what you want.
I have a field path in my elastic-search documents which has entries like this
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_011007/stderr
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_008874/stderr
#*Note -- I want to select all the documents having below line in the **path** field
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_009257/stderr
I want to make a like query on this path field given certain things(basically an AND condition on all the 3):-
I have given application number 1451299305289_0120
I have also given a task number 009257
The path field should also contain stderr
Given the above criteria the document having the path field as the 3rd line should be selected
This is what I have tries so far
http://localhost:9200/logstash-*/_search?q=application_1451299305289_0120 AND path:stderr&size=50
This query fulfills the 3rd criteria, and partially the 1st criteria i.e if I search for 1451299305289_0120 instead of application_1451299305289_0120, I got 0 results. (What I really need is like search on 1451299305289_0120)
When I tried this
http://10.30.145.160:9200/logstash-*/_search?q=path:*_1451299305289_0120*008779 AND path:stderr&size=50
I got the result, but using * at the start is a costly operation. Is their another way to achieve this effectively (like using nGram and using fuzzy-search of elastic-search)
This can be achieved by using Pattern Replace Char Filter. You just extract only important bits of information with regex. This is my setup
POST log_index
{
"settings": {
"analysis": {
"analyzer": {
"app_analyzer": {
"char_filter": [
"app_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
},
"path_analyzer": {
"char_filter": [
"path_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
},
"task_analyzer": {
"char_filter": [
"task_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"char_filter": {
"app_extractor": {
"type": "pattern_replace",
"pattern": ".*application_(.*)/container.*",
"replacement": "$1"
},
"path_extractor": {
"type": "pattern_replace",
"pattern": ".*/(.*)",
"replacement": "$1"
},
"task_extractor": {
"type": "pattern_replace",
"pattern": ".*container.{27}(.*)/.*",
"replacement": "$1"
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "keyword",
"fields": {
"application_number": {
"type": "string",
"analyzer": "app_analyzer"
},
"path": {
"type": "string",
"analyzer": "path_analyzer"
},
"task": {
"type": "string",
"analyzer": "task_analyzer"
}
}
}
}
}
}
}
I am extracting application number, task number and path with regex. You might want to optimize task regex a bit if you have some other log pattern, then we can use Filters to search.A big advantage of using filters is that they are cached and make subsequent calls faster.
I indexed sample log like this
PUT log_index/your_type/1
{
"name" : "/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_009257/stderr"
}
This query will give you desired results
GET log_index/_search
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"name.application_number": "1451299305289_0120"
}
},
{
"term": {
"name.task": "009257"
}
},
{
"term": {
"name.path": "stderr"
}
}
]
}
}
}
}
}
On a side note filtered query is deprecated in ES 2.x, just use filter directly.Also path hierarchy might be useful for some other uses
Hope this helps :)