Elasticsearch: Problem with Italian analyzer - elasticsearch

I noticed that the ES Italian analyzer does not stem words long less than 6 characters and this obviously creates a problem for my work. I tried to solve it customizing the analyzer but unfortunately did not succeed. So I implemented in the index an hunspell analyzer but it isn't very scalable so I want to keep the analyzer algorithmic. Does someone have a suggestion on how to solve this problem?

The default Italian language stemmer in Elasticsearch is not the normal snowball stemmer, but a light version called light_italian. I was able to reproduce that it doesn't stem some tokens that are shorter than 6 characters, as you described:
POST /_analyze
{
"analyzer": "italian",
"text": "pronto propio logie logia morte"
}
But Elasticsearch includes another italian stemmer token filter called italian that performs stemming on these tokens. You can test it with this code:
PUT /my-italian-stemmer-index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stemmer"
]
}
},
"filter": {
"my_stemmer": {
"type": "stemmer",
"language": "italian"
}
}
}
}
}
POST /my-italian-stemmer-index/_analyze
{
"analyzer": "my_analyzer",
"text": "pronto propio logie logia morte"
}
If you want to use it, you should rebuild the original Italian analyzer and swap out the token filter:
PUT /italian_example
{
"settings": {
"analysis": {
"filter": {
"italian_elision": {
"type": "elision",
"articles": [
"c", "l", "all", "dall", "dell",
"nell", "sull", "coll", "pell",
"gl", "agl", "dagl", "degl", "negl",
"sugl", "un", "m", "t", "s", "v", "d"
],
"articles_case": true
},
"italian_stop": {
"type": "stop",
"stopwords": "_italian_"
},
"italian_keywords": {
"type": "keyword_marker",
"keywords": ["esempio"]
},
"italian_stemmer": {
"type": "stemmer",
"language": "italian"
}
},
"analyzer": {
"rebuilt_italian": {
"tokenizer": "standard",
"filter": [
"italian_elision",
"lowercase",
"italian_stop",
"italian_keywords",
"italian_stemmer"
]
}
}
}
}
}

Related

Elasticsearch German stemmer doesn't do plural

I'm working on a basic German analyzer in Elasticsearch which is defined as follows
{
"settings": {
"analysis": {
"filter": {
"german_stemmer": {
"type": "snowball",
"language": "German"
},
"german_stop": {
"type": "stop",
"stopwords": "_german_"
}
},
"analyzer": {
"german_search": {
"filter": ["lowercase", "german_stop", "german_stemmer"],
"tokenizer": "standard"
}
}
}
}
}
While testing it I realized that it is not dealing well with Kürbis and Kürbisse. Stemming those two words brings different output while from my understanding (just what I read online) Kurbis stands for Pumpkin and Kurbisse is Pumpkins. It looks like the stemmer is not dealing well with plurals.
Any ideas on how can I solve this?

custom tokenizer without using built-in token filters

How to create a custom tokenizer without using default built-in token filters?. e.g: Text: "Samsung Galaxy S9"
I want to tokenize this text such that it should be indexed like this
["samsung", "galaxy", "s9", "samsung galaxy s9", "samsung s9", "samsung galaxy" , "galaxy s9"].
How would I do that?
PUT testindex
{
"settings": {
"analysis": {
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 20,
"min_shingle_size": 2,
"output_unigrams": "true"
}
},
"analyzer": {
"analyzer_shingle": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"filter_shingle"
]
}
}
}
},
"mappings": {
"product": {
"properties": {
"title": {
"analyzer": "analyzer_shingle",
"search_analyzer": "standard",
"type": "text"
}
}
}
}
}
POST testindex/product/1
{
"title": "Samsung Galaxy S9"
}
GET testindex/_analyze
{
"analyzer": "analyzer_shingle",
"text": ["Samsung Galaxy S9"]
}
You can find more about the shingles here and here
The first example is great and it covers a lot. If you want to use the standard tokenizer and not the whitespace, then you'll have to take care of the stop words as the blog post describes. Both of the urls are official ES sources

Elastic Search - how to use language analyzer with UTF-8 filter?

I have a problem with ElasticSearch language analyzer. I am working on Lithuanian language, so I am using Lithuanian language analyzer. Analyzer works fine and I got all word cases I need. For example, I index Lithuania city "Klaipėda":
PUT /cities/city/1
{
"name": "Klaipėda"
}
Problem is that I also need to get a result, when I am searching "Klaipėda" only in Latin alphabet ("Klaipeda") and in all Lithuanian cases:
Nomanitive case: "Klaipeda"
Genitive case: "Klaipedos"
...
Locative case: "Klaipedoje"
"Klaipėda", "Klaipėdos", "Klaipėdoje" - works, but "Klaipeda", "Klaipedos", "Klaipedoje" - not.
My index:
PUT /cities
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"fields": {
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"md_folded_analyzer": {
"type": "lithuanian",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"lithuanian_stop",
"lithuanian_keywords",
"lithuanian_stemmer"
]
}
}
}
}
}
and search query:
GET /cities/_search
{
"query": {
"multi_match" : {
"type": "most_fields",
"query": "klaipeda",
"fields": [ "name", "name.folded" ]
}
}
}
What I am doing wrong? Thanks for help.
The technique you are using here is so-called multi-fields. The limitation of the underlying name.folded field is that you can't perform search against it - you can perform only sorting by name.folded and aggregation.
To make a way round this I've come up with the following set-up:
Separate fields set-up (to eliminate duplicates - just specify copy_to):
curl -XPUT http://localhost:9200/cities -d '
{
"mappings": {
"city": {
"properties": {
"name": {
"type": "string",
"analyzer": "lithuanian",
"copy_to": "folded",
},
"folded": {
"type": "string",
"analyzer": "md_folded_analyzer"
}
}
}
}
}'
Change the type of your analyzer to custom as it described here, because otherwise the asciifolding is not got into the config. And more important - asciifolding should go after all stemming / stop-words in Lithuanian language, because after folding the word can miss desired sense.
curl -XPUT http://localhost:9200/my_cities -d '
{
"settings": {
"analysis": {
"filter": {
"lithuanian_stop": {
"type": "stop",
"stopwords": "_lithuanian_"
},
"lithuanian_stemmer": {
"type": "stemmer",
"language": "lithuanian"
}
},
"analyzer": {
"md_folded_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"lithuanian_stop",
"lithuanian_stemmer",
"asciifolding"
]
}
}
}
}
}
Sorry I've eliminated lithuanian_keywords - it requires additional set-up, which I missed here. But I hope you've got the idea.

"Letter" tokenizer and "word_delimiter" filter not working with underscores

I built an ElasticSearch index using a custom analyzer which uses letter tokenizer and lower_case and word_delimiter token filters. Then I tried searching for documents containing underscore-separated sub-words, e.g. abc_xyz, using only one of the sub-words, e.g. abc, but it didn't come back with any result. When I tried the full-word, i.e. abc_xyz, it did find the document.
Then I changed the document to have dash-separated sub-words instead, e.g. abc-xyz and tried to search by sub-words again and it worked.
To try to understand what is going on, I thought I would check the terms generated for my documents using _termvector service, and the result was identical for both, the underscore-separated sub-words and the dash-separated sub-words, so really I expect the result of searching to be identical in both cases.
Any idea what I could be doing wrong?
If it helps, this is the settings I used for my index:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"cmt_value_analyzer": {
"tokenizer": "letter",
"filter": [
"lowercase",
"my_filter"
],
"type": "custom"
}
},
"filter": {
"my_filter": {
"type": "word_delimiter"
}
}
}
}
},
"mappings": {
"alertmodel": {
"properties": {
"name": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"productId": {
"type": "double"
},
"productName": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"link": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"updatedOn": {
"type": "date"
}
}
}
}
}

Using Shingles and Stop words with Elasticsearch and Lucene 4.4

In the index I'm building, I'm interested in running a query, then (using facets) returning the shingles of that query. Here's the analyzer I'm using on the text:
{
"settings": {
"analysis": {
"analyzer": {
"shingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer"
]
}
},
"filter": {
"custom_stemmer" : {
"type": "stemmer",
"name": "english"
},
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "3"
}
}
}
}
}
The major issue is that, with Lucene 4.4, stop filters no longer support the enable_position_increments parameter to eliminate shingles that contain stop words. Instead, I'd get results like..
"red and yellow"
"terms": [
{
"term": "red",
"count": 43
},
{
"term": "red _",
"count": 43
},
{
"term": "red _ yellow",
"count": 43
},
{
"term": "_ yellow",
"count": 42
},
{
"term": "yellow",
"count": 42
}
]
Naturally this GREATLY skews the number of shingles returned. Is there a way post-Lucene 4.4 to manage this without doing post-processing on the results?
Probably not the most optimal solution, but the most blunt would be to add another filter to your analyzer to kill "_" filler tokens. In the example below I called it "kill_fillers":
"shingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer",
"kill_fillers"
],
...
Add "kill_fillers" filter to your list of filters:
"filters":{
...
"kill_fillers": {
"type": "pattern_replace",
"pattern": ".*_.*",
"replace": "",
},
...
}
im not sure if this helps, but in elastic definition of shingles, you can use the parameter filler_token which is by default _. set it to, for example, an empty string:
$indexParams['body']['settings']['analysis']['filter']['shingle-filter']['filler_token'] = "";
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/analysis-shingle-tokenfilter.html

Resources