Elasticsearch snowball in French not stemming correctly - elasticsearch

I've seen a problem with the same stem word in French.
Here is an example: snowball in French
or
curl -XDELETE http://localhost:9200/stacko36088193
curl -XPOST http://localhost:9200/stacko36088193 -d '
{
"index": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "snowball",
"language" : "French"
}
}
}
}
}'
curl 'localhost:9200/stacko36088193/_analyze?pretty=1&analyzer=my_analyzer' -d 'développeur développeuse'
And see token keys
{
"tokens" : [ {
"token" : "développeur",
"start_offset" : 0,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "développ",
"start_offset" : 12,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
How can you do to have the same stem for all of these words?

Related

removing special characters and words from a url elasticsearch

I was looking for a way to generate words and special characters as tokens from a url.
eg. I have a url https://www.google.com/
I want to generate tokens in elastic as https, www,google, com, :, /, /, ., ., /
You can define custom analyzer with letter tokenizer as shown below:
PUT index3
{
"settings": {
"analysis": {
"analyzer": {
"my_email": {
"tokenizer": "letter",
"filter": [
"lowercase"
]
}
}
}
}
}
Test API:
POST index3/_analyze
{
"text": [
"https://www.google.com/"
],
"analyzer": "my_email"
}
Output:
{
"tokens" : [
{
"token" : "https",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "www",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 1
},
{
"token" : "google",
"start_offset" : 12,
"end_offset" : 18,
"type" : "word",
"position" : 2
},
{
"token" : "com",
"start_offset" : 19,
"end_offset" : 22,
"type" : "word",
"position" : 3
}
]
}

Elasticsearch synonym_graph filter not giving all tokens

I'm trying to use synonym_graph filter in the analyzer but it is not generating the result we need.
This is my curl command to analyze text:
curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
"tokenizer": "whitespace",
"filter": [
{
"type": "synonym_graph",
"lenient": true,
"synonyms": [ "one market,responsis", "one,1"]
}
],
"text": "responsis"
}'
I got analyzed tokens: one, responsis and market
Given response:
{
"tokens" : [
{
"token" : "one",
"start_offset" : 0,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "responsis",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 0,
"positionLength" : 2
},
{
"token" : "market",
"start_offset" : 0,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 1
}
]
}
But i want the analyzed tokens to be: Responis, One, Market and 1
Some how it is not giving all tokens to generate the result.
[Note: I don't want to add synonyms in one group.]
Thank you for your answer in advance.

How to map one word to another word in elasticsearch?

how can I map a word to another word in Elasticsearch?. That is suppose I have the following data document
{
"carName" : "Porche"
"review": " this car is so awesome"
}
Now when I search good/fantastic etc, it should map to "awesome".
Is there any way I can do this in elasticsearch?
Yes, you can achieve this by using a synonym token filter.
First you need to define a new custom analyzer in your index and use that analyzer in your mapping.
curl -XPUT localhost:9200/cars -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"synonyms"
]
}
},
"filter": {
"synonyms": {
"type": "synonym",
"synonyms": [
"good, awesome, fantastic"
]
}
}
}
},
"mappings": {
"car": {
"properties": {
"carName": {
"type": "string"
},
"review": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
You can add as many synonyms as you want, either in the settings directly or in a separate file that you can reference in the settings using the synonyms_path property.
Then we can index your sample document above:
curl -XPUT localhost:9200/cars/car/1 -d '{
"carName": "Porche",
"review": " this car is so awesome"
}'
What is going to happen is that when the synonyms token filter kicks in, it will also index the tokens good and fantastic along with awesome so that you can search and find that document by those tokens as well. Concretely, analyzing the sentence this car is so awesome...
curl -XGET 'localhost:9200/cars/_analyze?analyzer=my_analyzer&pretty' -d 'this car is so awesome'
...will produce the following tokens (see the last three tokens)
{
"tokens" : [ {
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "car",
"start_offset" : 5,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "is",
"start_offset" : 9,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "so",
"start_offset" : 12,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "good",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
}, {
"token" : "awesome",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
}, {
"token" : "fantastic",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
} ]
}
Finally, you can search like this and the document will be retrieved:
curl -XGET localhost:9200/cars/car/_search?q=review:good

Combine search terms automatically when with Elasticssearch?

Using elasticsearch for searching our documents we discovered that when we search for "wave board" we get no good results, because documents containing "waveboard" are not at the top of the results. Google does this kind of "term combining". Is there a simple way to do this in ES?
Found a good solution: Create a custom anaylzer with a shingle filter using "" as a token separator and use that in a query (use bool query to combine with standard queries)
To do this at analysis time, you can also use what is know as a "decompounding"
token filter. Here is an example to decompound the text "catdogmouse" into the
tokens "cat", "dog", and "mouse":
POST /decom
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"decom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["decom_filter"]
}
},
"filter": {
"decom_filter": {
"type": "dictionary_decompounder",
"word_list": ["cat", "dog", "mouse"]
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"body": {
"type": "string",
"analyzer": "decom_analyzer"
}
}
}
}
}
And then you can see how they are applied to certain terms:
POST /decom/_analyze?field=body&pretty
racecatthings
{
"tokens" : [ {
"token" : "racecatthings",
"start_offset" : 1,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 1,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
And another: (you should be able to extrapolate this to separate "waveboard"
into "wave" and "board")
POST /decom/_analyze?field=body&pretty
catdogmouse
{
"tokens" : [ {
"token" : "catdogmouse",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "dog",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "mouse",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}

Elasticsearch, how to concatenate words then ngram it?

I'd like to concatenate words then ngram it.
What's the correct setting for elasticsearch?
In english,
from: stack overflow
==> stackoverflow : concatenate first,
==> sta / tac / ack / cko / kov / ... and etc (min_gram: 3, max_gram: 10)
To do the concatenation I'm assuming that you just want to remove all spaces from your input data. To do this, you need to implement a pattern_replace char filter that replaces space with nothing.
Setting up the ngram tokenizer should be easy - just specify your token min/max lengths.
It's worth adding a lowercase token filter too - to make searching case insensitive.
curl -XPOST localhost:9200/my_index -d '{
"index": {
"analysis": {
"analyzer": {
"my_new_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "my_ngram_tokenizer",
"char_filter" : ["my_pattern"],
"type": "custom"
}
},
"char_filter" : {
"my_pattern":{
"type":"pattern_replace",
"pattern":"\u0020",
"replacement":""
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "3",
"max_gram" : "10",
"token_chars": ["letter", "digit", "punctuation", "symbol"]
}
}
}
}
}'
testing this:
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=my_new_analyzer&pretty' -d 'stack overflow'
gives the following (just a small part shown below):
{
"tokens" : [ {
"token" : "sta",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "stac",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
}, {
"token" : "stack",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 3
}, {
"token" : "stacko",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 4
}, {
"token" : "stackov",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 5
}, {

Resources