Elasticsearch - Custom stem override with wildcard character - elasticsearch

I have implemented light English stemming in Elasticsearch.
I'm able to add a custom stem override so that "Guitarist" => "Guitar", for example, but I would like to add this as a general rule, so that "Guitarist" => "Guitar", "Violinist => Violin" etc.
Can I achieve this without using regex?

For anyone looking at a similar problem, it appears that regex is the only solution. Example below specifically for words ending "ist".
{
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"ist_filter"
],
"filter": [
"lowercase",
"my_stem"
]
}
},
"filter": {
"my_stem": {
"type": "stemmer",
"language": "light_english"
}
},
"char_filter": {
"ist_filter": {
"type": "pattern_replace",
"pattern": "(.*)ist$",
"replacement": "$1"
}
}
}
}
Exclusions can be added to the pattern e.g. the below would ignore the words "mist" and "twist", but this would only be practical for a (very) limited number of exclusions.
"pattern": "^(?!m|tw)(.*)ist$"

Related

Elasticsearch: Problem with Italian analyzer

I noticed that the ES Italian analyzer does not stem words long less than 6 characters and this obviously creates a problem for my work. I tried to solve it customizing the analyzer but unfortunately did not succeed. So I implemented in the index an hunspell analyzer but it isn't very scalable so I want to keep the analyzer algorithmic. Does someone have a suggestion on how to solve this problem?
The default Italian language stemmer in Elasticsearch is not the normal snowball stemmer, but a light version called light_italian. I was able to reproduce that it doesn't stem some tokens that are shorter than 6 characters, as you described:
POST /_analyze
{
"analyzer": "italian",
"text": "pronto propio logie logia morte"
}
But Elasticsearch includes another italian stemmer token filter called italian that performs stemming on these tokens. You can test it with this code:
PUT /my-italian-stemmer-index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stemmer"
]
}
},
"filter": {
"my_stemmer": {
"type": "stemmer",
"language": "italian"
}
}
}
}
}
POST /my-italian-stemmer-index/_analyze
{
"analyzer": "my_analyzer",
"text": "pronto propio logie logia morte"
}
If you want to use it, you should rebuild the original Italian analyzer and swap out the token filter:
PUT /italian_example
{
"settings": {
"analysis": {
"filter": {
"italian_elision": {
"type": "elision",
"articles": [
"c", "l", "all", "dall", "dell",
"nell", "sull", "coll", "pell",
"gl", "agl", "dagl", "degl", "negl",
"sugl", "un", "m", "t", "s", "v", "d"
],
"articles_case": true
},
"italian_stop": {
"type": "stop",
"stopwords": "_italian_"
},
"italian_keywords": {
"type": "keyword_marker",
"keywords": ["esempio"]
},
"italian_stemmer": {
"type": "stemmer",
"language": "italian"
}
},
"analyzer": {
"rebuilt_italian": {
"tokenizer": "standard",
"filter": [
"italian_elision",
"lowercase",
"italian_stop",
"italian_keywords",
"italian_stemmer"
]
}
}
}
}
}

How to search by words written together among data where these words are written apart in Elasticsearch?

I have documents which have, let's say, 1 field - name of this document. Name may consist of several words written apart, for example:
{
"name": "first document"
},
{
"name": "second document"
}
My goal is to be able to search for these documents by strings:
firstdocument, seconddocumen
As you can see, search strings are written wrong, but they still match those documents if we delete whitespaces from documents' names. This issue could be handled by creating another field with the same string but without whitespaces, but it seems like extra data unless there's no other way to do that.
I need something similar to this:
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":2,
"output_unigrams":"true",
"token_separator": ""
}
],
"text": "first document"
}
But the other way around. I need kind of apply this not to a search text, but for search objects (name of documents), so I could find documents with a little misspell in a search text. How should it be done?
I suggest using multi-fields with an analyzer for removing whitespaces.
Analyzer
"no_spaces": {
"filter": [
"lowercase"
],
"char_filter": [
"remove_spaces"
],
"tokenizer": "standard"
}
Char Filter
"remove_spaces": {
"type": "pattern_replace",
"pattern": "[ ]",
"replacement": ""
}
Field Mapping
"name": {
"type": "text",
"fields": {
"without_spaces": {
"type": "text",
"analyzer": "no_spaces"
}
}
}
Query
GET /_search
{
"query": {
"match": {
"name.without_spaces": {
"query": "seconddocumen",
"fuzziness": "AUTO"
}
}
}
}
EDIT:
For completion: An alternative to the remove_spaces filter could be the shingle filter:
"analysis": {
"filter": {
"shingle_filter": {
"type": "shingle",
"output_unigrams": "false",
"token_separator": ""
}
},
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"shingle_filter"
]
}
}
}

multiple like query in elastic search

I have a field path in my elastic-search documents which has entries like this
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_011007/stderr
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_008874/stderr
#*Note -- I want to select all the documents having below line in the **path** field
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_009257/stderr
I want to make a like query on this path field given certain things(basically an AND condition on all the 3):-
I have given application number 1451299305289_0120
I have also given a task number 009257
The path field should also contain stderr
Given the above criteria the document having the path field as the 3rd line should be selected
This is what I have tries so far
http://localhost:9200/logstash-*/_search?q=application_1451299305289_0120 AND path:stderr&size=50
This query fulfills the 3rd criteria, and partially the 1st criteria i.e if I search for 1451299305289_0120 instead of application_1451299305289_0120, I got 0 results. (What I really need is like search on 1451299305289_0120)
When I tried this
http://10.30.145.160:9200/logstash-*/_search?q=path:*_1451299305289_0120*008779 AND path:stderr&size=50
I got the result, but using * at the start is a costly operation. Is their another way to achieve this effectively (like using nGram and using fuzzy-search of elastic-search)
This can be achieved by using Pattern Replace Char Filter. You just extract only important bits of information with regex. This is my setup
POST log_index
{
"settings": {
"analysis": {
"analyzer": {
"app_analyzer": {
"char_filter": [
"app_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
},
"path_analyzer": {
"char_filter": [
"path_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
},
"task_analyzer": {
"char_filter": [
"task_extractor"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"char_filter": {
"app_extractor": {
"type": "pattern_replace",
"pattern": ".*application_(.*)/container.*",
"replacement": "$1"
},
"path_extractor": {
"type": "pattern_replace",
"pattern": ".*/(.*)",
"replacement": "$1"
},
"task_extractor": {
"type": "pattern_replace",
"pattern": ".*container.{27}(.*)/.*",
"replacement": "$1"
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "keyword",
"fields": {
"application_number": {
"type": "string",
"analyzer": "app_analyzer"
},
"path": {
"type": "string",
"analyzer": "path_analyzer"
},
"task": {
"type": "string",
"analyzer": "task_analyzer"
}
}
}
}
}
}
}
I am extracting application number, task number and path with regex. You might want to optimize task regex a bit if you have some other log pattern, then we can use Filters to search.A big advantage of using filters is that they are cached and make subsequent calls faster.
I indexed sample log like this
PUT log_index/your_type/1
{
"name" : "/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_009257/stderr"
}
This query will give you desired results
GET log_index/_search
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"name.application_number": "1451299305289_0120"
}
},
{
"term": {
"name.task": "009257"
}
},
{
"term": {
"name.path": "stderr"
}
}
]
}
}
}
}
}
On a side note filtered query is deprecated in ES 2.x, just use filter directly.Also path hierarchy might be useful for some other uses
Hope this helps :)

Elasticsearch replace whitespace

I'm trying to find a tokenizer in elasticsearch that would replace all the whitespaces with a blank and convert multiple words into a single word.
For example: Abd al Qadir ===> Abdalqadir
A way to achieve that would be to create a custom filter using the pattern_replace filter, and create a custom analyzer with that filter and the lowercase one.
Here's an example of how the configuration would look like:
"settings": {
"index": {
"analysis": {
"filter": {
"whitespace_remove": {
"type": "pattern_replace",
"pattern": " ",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"whitespace_remove"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
}

Elasticsearch "pattern_replace", replacing whitespaces while analyzing

Basically I want to remove all whitespaces and tokenize the whole string as a single token. (I will use nGram on top of that later on.)
This is my index settings:
"settings": {
"index": {
"analysis": {
"filter": {
"whitespace_remove": {
"type": "pattern_replace",
"pattern": " ",
"replacement": ""
}
},
"analyzer": {
"meliuz_analyzer": {
"filter": [
"lowercase",
"whitespace_remove"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
Instead of "pattern": " ", I tried "pattern": "\\u0020" and \\s , too.
But when I analyze the text "beleza na web", it still creates three separate tokens: "beleza", "na" and "web", instead of one single "belezanaweb".
The analyzer analyzes a string by tokenizing it first then applying a series of token filters. You have specified tokenizer as standard means the input is already tokenized using standard tokenizer which created the tokens separately. Then pattern replace filter is applied to the tokens.
Use keyword tokenizer instead of your standard tokenizer. Rest of the mapping is fine.
You can change your mapping as below
"settings": {
"index": {
"analysis": {
"filter": {
"whitespace_remove": {
"type": "pattern_replace",
"pattern": " ",
"replacement": ""
}
},
"analyzer": {
"meliuz_analyzer": {
"filter": [
"lowercase",
"whitespace_remove",
"nGram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}

Resources