Finding Synonym library - elasticsearch

I have been using synonym in Elasticsearch to map data. I have created a index setting like this
PUT uk-2016.06.22
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"british,uk,england,britain"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
Rather than created manually a library txt document in mapping synonyms, is there any synonym library available for downloading to map data for elasticsearch application? Because it seems difficult for me to find one.

Related

elasticsearch extract keywords from text

I have more than 4000 keywords I want to indexed by elasticsearch.
I want to pass it the text and extract the existing keywords.
The first problem is that when I pass a few numbers it works but when I pass a lot of keywords it extracts words that are not in the text.
The second problem is that it only extracts the words before and after space.
I want to extract keyword it from inside the word
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
},
"my_analyzer_shingle": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase",
"shingle"
]
}
}
}
}
}
POST /test/your_type/
{
"keyword": "search"
}
POST /test/your_type/_search
{
"query": {
"match": {
"keyword": "elasticsearch"
}
}
}

Elasticsearch Single analyzer across multiple index

I have time-based indices
students-2018
students-2019
students-2020
I have defined 1 analyzer with synonyms, I want to reuse the same analyzer across multiple indexes, how do I achieve that?
You can define an index template and then create your custom analyzer with that template which includes all your student indices.
You can add your index-pattern in below index template call as mention in the official doc.
Sample index template def
{
"index_patterns": ["student-*"],
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
Now all your student indices like students-2018 , students-2019 will have this my_custom_analyzer which is defined in the index template.
Create a student index without any setting and analyzer like
http://{{you-es-hostname}}/student-2018
And then check its setting using GET http://{{you-es-hostname}}/student-2018, which would give below output and includes the analyzer created in the index template.
{
"student-2018": {
"aliases": {},
"mappings": {},
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "student-2018",
"creation_date": "1588653678067",
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"filter": [
"lowercase",
"asciifolding"
],
"char_filter": [
"html_strip"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "kjGEgKCOSJeIlrASP-RaMQ",
"version": {
"created": "7040299"
}
}
}
}
}

Custom predefined stopword list in Elasticsearch

How can i define a custom stopword list globally in a way it is accessible from all indexes.
it would be ideal to use this stopword list just like the way we use predefined language-specific stopword lists:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_my_predefined_stopword_list_"
}
}
}
}
}
The official elastcisearch documentation describes how to create a custom filter with a list of stopwords. You can find the description here:
https://www.elastic.co/guide/en/elasticsearch/guide/current/using-stopwords.html
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": [ "si", "esta", "el", "la" ]
},
"light_spanish": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"my_spanish": {
"tokenizer": "spanish",
"filter": [
"lowercase",
"asciifolding",
"spanish_stop",
"light_spanish"
]
}
}
}
}
}
After defining this filter spanish_stop you can use it in the definition of your indices.

Elastic search cluster level analyzer

How can I define one custom analyzer that will be used in more than one index (in a cluster level)? All the examples I can find shows how to create a custom analyzer on a specific index.
My analyzer for example:
PUT try_index
{
"settings": {
"analysis": {
"filter": {
"od_synonyms": {
"type": "synonym",
"synonyms": [
"dog, cat => animal",
"john, lucas => boy",
"emma, kate => girl"
]
}
},
"analyzer": {
"od_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"od_synonyms"
]
}
}
}
},
"mappings": {
"record": {
"properties": {
"name": {
"type": "string",
"analyzer":"standard",
"search_analyzer": "od_analyzer"
}
}
}
}
}
Any idea how to change my analyzer scope to cluster level?
thanks
There is no "scope" for analyzers. But you can do something similar with index templates:
PUT /_template/some_name_here
{
"template": "a*",
"order": 0,
"settings": {
"analysis": {
"filter": {
"od_synonyms": {
"type": "synonym",
"synonyms": [
"dog, cat => animal",
"john, lucas => boy",
"emma, kate => girl"
]
}
},
"analyzer": {
"od_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"od_synonyms"
]
}
}
}
}
}
And at "template" you should put the name of the indices that this template should be applied to when the index is created. You could very well specify "*" and matching all the indices. I think that's the best you can do for what you want.

multiple like query in elastic search

I have a field path in my elastic-search documents which has entries like this
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_011007/stderr
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_008874/stderr
#*Note -- I want to select all the documents having below line in the **path** field
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_009257/stderr
I want to make a like query on this path field given certain things(basically an AND condition on all the 3):-
I have given application number 1451299305289_0120
I have also given a task number 009257
The path field should also contain stderr
Given the above criteria the document having the path field as the 3rd line should be selected
This is what I have tries so far
http://localhost:9200/logstash-*/_search?q=application_1451299305289_0120 AND path:stderr&size=50
This query fulfills the 3rd criteria, and partially the 1st criteria i.e if I search for 1451299305289_0120 instead of application_1451299305289_0120, I got 0 results. (What I really need is like search on 1451299305289_0120)
When I tried this
http://10.30.145.160:9200/logstash-*/_search?q=path:*_1451299305289_0120*008779 AND path:stderr&size=50
I got the result, but using * at the start is a costly operation. Is their another way to achieve this effectively (like using nGram and using fuzzy-search of elastic-search)
This can be achieved by using Pattern Replace Char Filter. You just extract only important bits of information with regex. This is my setup
POST log_index
{
"settings": {
"analysis": {
"analyzer": {
"app_analyzer": {
"char_filter": [
"app_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
},
"path_analyzer": {
"char_filter": [
"path_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
},
"task_analyzer": {
"char_filter": [
"task_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"char_filter": {
"app_extractor": {
"type": "pattern_replace",
"pattern": ".*application_(.*)/container.*",
"replacement": "$1"
},
"path_extractor": {
"type": "pattern_replace",
"pattern": ".*/(.*)",
"replacement": "$1"
},
"task_extractor": {
"type": "pattern_replace",
"pattern": ".*container.{27}(.*)/.*",
"replacement": "$1"
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "keyword",
"fields": {
"application_number": {
"type": "string",
"analyzer": "app_analyzer"
},
"path": {
"type": "string",
"analyzer": "path_analyzer"
},
"task": {
"type": "string",
"analyzer": "task_analyzer"
}
}
}
}
}
}
}
I am extracting application number, task number and path with regex. You might want to optimize task regex a bit if you have some other log pattern, then we can use Filters to search.A big advantage of using filters is that they are cached and make subsequent calls faster.
I indexed sample log like this
PUT log_index/your_type/1
{
"name" : "/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_009257/stderr"
}
This query will give you desired results
GET log_index/_search
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"name.application_number": "1451299305289_0120"
}
},
{
"term": {
"name.task": "009257"
}
},
{
"term": {
"name.path": "stderr"
}
}
]
}
}
}
}
}
On a side note filtered query is deprecated in ES 2.x, just use filter directly.Also path hierarchy might be useful for some other uses
Hope this helps :)

Resources