How to map one word to another word in elasticsearch? - elasticsearch

how can I map a word to another word in Elasticsearch?. That is suppose I have the following data document
{
"carName" : "Porche"
"review": " this car is so awesome"
}
Now when I search good/fantastic etc, it should map to "awesome".
Is there any way I can do this in elasticsearch?

Yes, you can achieve this by using a synonym token filter.
First you need to define a new custom analyzer in your index and use that analyzer in your mapping.
curl -XPUT localhost:9200/cars -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"synonyms"
]
}
},
"filter": {
"synonyms": {
"type": "synonym",
"synonyms": [
"good, awesome, fantastic"
]
}
}
}
},
"mappings": {
"car": {
"properties": {
"carName": {
"type": "string"
},
"review": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
You can add as many synonyms as you want, either in the settings directly or in a separate file that you can reference in the settings using the synonyms_path property.
Then we can index your sample document above:
curl -XPUT localhost:9200/cars/car/1 -d '{
"carName": "Porche",
"review": " this car is so awesome"
}'
What is going to happen is that when the synonyms token filter kicks in, it will also index the tokens good and fantastic along with awesome so that you can search and find that document by those tokens as well. Concretely, analyzing the sentence this car is so awesome...
curl -XGET 'localhost:9200/cars/_analyze?analyzer=my_analyzer&pretty' -d 'this car is so awesome'
...will produce the following tokens (see the last three tokens)
{
"tokens" : [ {
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "car",
"start_offset" : 5,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "is",
"start_offset" : 9,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "so",
"start_offset" : 12,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "good",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
}, {
"token" : "awesome",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
}, {
"token" : "fantastic",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
} ]
}
Finally, you can search like this and the document will be retrieved:
curl -XGET localhost:9200/cars/car/_search?q=review:good

Related

Elasticsearch synonym_graph filter not giving all tokens

I'm trying to use synonym_graph filter in the analyzer but it is not generating the result we need.
This is my curl command to analyze text:
curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
"tokenizer": "whitespace",
"filter": [
{
"type": "synonym_graph",
"lenient": true,
"synonyms": [ "one market,responsis", "one,1"]
}
],
"text": "responsis"
}'
I got analyzed tokens: one, responsis and market
Given response:
{
"tokens" : [
{
"token" : "one",
"start_offset" : 0,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "responsis",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 0,
"positionLength" : 2
},
{
"token" : "market",
"start_offset" : 0,
"end_offset" : 9,
"type" : "SYNONYM",
"position" : 1
}
]
}
But i want the analyzed tokens to be: Responis, One, Market and 1
Some how it is not giving all tokens to generate the result.
[Note: I don't want to add synonyms in one group.]
Thank you for your answer in advance.

Return sets of keywords derived from fields in ElasticSearch

Im kinda new to this i need help, i looked online couldnt find any answer im looking for. Basically, what im trying to do is for autocomplete based on keywords derived from some textfields
Given an example of my indices:
"name": "One liter of Chocolate Milk"
"name": "Milo Milk 250g"
"name": "HiLow low fat milk"
"name": "Yoghurt strawberry"
"name": "Milk Nutrisoy"
So when i type in "mi", im expecting to get the results like:
"milk"
"milo"
"milo milk"
"chocolate milk"
etc
Very good example is this aliexpress.com autocomplete
Thanks in advance
That seems like a good use case for the shingle token filter
curl -XPUT localhost:9200/your_index -d '{
"settings": {
"analysis": {
"analyzer": {
"my_shingles": {
"tokenizer": "standard",
"filter": [
"lowercase",
"shingles"
]
}
},
"filter": {
"shingles": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": true
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"field": {
"type": "string",
"analyzer": "my_shingles"
}
}
}
}
}'
If you analyze Milo Milk 250g with this analyzer, you'll get the following tokens:
curl -XGET 'localhost:9200/your_index/_analyze?analyzer=my_shingles&pretty' -d 'Milo Milk 250g'
{
"tokens" : [ {
"token" : "milo",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "milo milk",
"start_offset" : 0,
"end_offset" : 9,
"type" : "shingle",
"position" : 0
}, {
"token" : "milk",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "milk 250g",
"start_offset" : 5,
"end_offset" : 14,
"type" : "shingle",
"position" : 1
}, {
"token" : "250g",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
So when searching for mi, you'll get the following tokens:
milo
milo milk
milk
milk 250g

Elasticsearch snowball in French not stemming correctly

I've seen a problem with the same stem word in French.
Here is an example: snowball in French
or
curl -XDELETE http://localhost:9200/stacko36088193
curl -XPOST http://localhost:9200/stacko36088193 -d '
{
"index": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "snowball",
"language" : "French"
}
}
}
}
}'
curl 'localhost:9200/stacko36088193/_analyze?pretty=1&analyzer=my_analyzer' -d 'développeur développeuse'
And see token keys
{
"tokens" : [ {
"token" : "développeur",
"start_offset" : 0,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "développ",
"start_offset" : 12,
"end_offset" : 24,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
How can you do to have the same stem for all of these words?

Combine search terms automatically when with Elasticssearch?

Using elasticsearch for searching our documents we discovered that when we search for "wave board" we get no good results, because documents containing "waveboard" are not at the top of the results. Google does this kind of "term combining". Is there a simple way to do this in ES?
Found a good solution: Create a custom anaylzer with a shingle filter using "" as a token separator and use that in a query (use bool query to combine with standard queries)
To do this at analysis time, you can also use what is know as a "decompounding"
token filter. Here is an example to decompound the text "catdogmouse" into the
tokens "cat", "dog", and "mouse":
POST /decom
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"decom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["decom_filter"]
}
},
"filter": {
"decom_filter": {
"type": "dictionary_decompounder",
"word_list": ["cat", "dog", "mouse"]
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"body": {
"type": "string",
"analyzer": "decom_analyzer"
}
}
}
}
}
And then you can see how they are applied to certain terms:
POST /decom/_analyze?field=body&pretty
racecatthings
{
"tokens" : [ {
"token" : "racecatthings",
"start_offset" : 1,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 1,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
And another: (you should be able to extrapolate this to separate "waveboard"
into "wave" and "board")
POST /decom/_analyze?field=body&pretty
catdogmouse
{
"tokens" : [ {
"token" : "catdogmouse",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "dog",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "mouse",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}

Elasticsearch, how to concatenate words then ngram it?

I'd like to concatenate words then ngram it.
What's the correct setting for elasticsearch?
In english,
from: stack overflow
==> stackoverflow : concatenate first,
==> sta / tac / ack / cko / kov / ... and etc (min_gram: 3, max_gram: 10)
To do the concatenation I'm assuming that you just want to remove all spaces from your input data. To do this, you need to implement a pattern_replace char filter that replaces space with nothing.
Setting up the ngram tokenizer should be easy - just specify your token min/max lengths.
It's worth adding a lowercase token filter too - to make searching case insensitive.
curl -XPOST localhost:9200/my_index -d '{
"index": {
"analysis": {
"analyzer": {
"my_new_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "my_ngram_tokenizer",
"char_filter" : ["my_pattern"],
"type": "custom"
}
},
"char_filter" : {
"my_pattern":{
"type":"pattern_replace",
"pattern":"\u0020",
"replacement":""
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "3",
"max_gram" : "10",
"token_chars": ["letter", "digit", "punctuation", "symbol"]
}
}
}
}
}'
testing this:
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=my_new_analyzer&pretty' -d 'stack overflow'
gives the following (just a small part shown below):
{
"tokens" : [ {
"token" : "sta",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "stac",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
}, {
"token" : "stack",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 3
}, {
"token" : "stacko",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 4
}, {
"token" : "stackov",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 5
}, {

Resources