Elasticsearch, how to concatenate words then ngram it? - elasticsearch

I'd like to concatenate words then ngram it.
What's the correct setting for elasticsearch?
In english,
from: stack overflow
==> stackoverflow : concatenate first,
==> sta / tac / ack / cko / kov / ... and etc (min_gram: 3, max_gram: 10)

To do the concatenation I'm assuming that you just want to remove all spaces from your input data. To do this, you need to implement a pattern_replace char filter that replaces space with nothing.
Setting up the ngram tokenizer should be easy - just specify your token min/max lengths.
It's worth adding a lowercase token filter too - to make searching case insensitive.
curl -XPOST localhost:9200/my_index -d '{
"index": {
"analysis": {
"analyzer": {
"my_new_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "my_ngram_tokenizer",
"char_filter" : ["my_pattern"],
"type": "custom"
}
},
"char_filter" : {
"my_pattern":{
"type":"pattern_replace",
"pattern":"\u0020",
"replacement":""
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "3",
"max_gram" : "10",
"token_chars": ["letter", "digit", "punctuation", "symbol"]
}
}
}
}
}'
testing this:
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=my_new_analyzer&pretty' -d 'stack overflow'
gives the following (just a small part shown below):
{
"tokens" : [ {
"token" : "sta",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "stac",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
}, {
"token" : "stack",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 3
}, {
"token" : "stacko",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 4
}, {
"token" : "stackov",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 5
}, {

Related

removing special characters and words from a url elasticsearch

I was looking for a way to generate words and special characters as tokens from a url.
eg. I have a url https://www.google.com/
I want to generate tokens in elastic as https, www,google, com, :, /, /, ., ., /
You can define custom analyzer with letter tokenizer as shown below:
PUT index3
{
"settings": {
"analysis": {
"analyzer": {
"my_email": {
"tokenizer": "letter",
"filter": [
"lowercase"
]
}
}
}
}
}
Test API:
POST index3/_analyze
{
"text": [
"https://www.google.com/"
],
"analyzer": "my_email"
}
Output:
{
"tokens" : [
{
"token" : "https",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "www",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 1
},
{
"token" : "google",
"start_offset" : 12,
"end_offset" : 18,
"type" : "word",
"position" : 2
},
{
"token" : "com",
"start_offset" : 19,
"end_offset" : 22,
"type" : "word",
"position" : 3
}
]
}

Adjust words position with stopword filter

The stopword filter of Elasticsearch maintains words position : https://www.elastic.co/guide/en/elasticsearch/guide/current/using-stopwords.html#maintaining-positions
For example here is my analyzer:
"myAnalyzer": {
"tokenizer": "standard",
"filter": ["asciifolding", "french_elision", "lowercase", "french_stop", "french_stemmer"]
}
And here the analyze of my string :
curl -XGET 'localhost:9200/_analyze?pretty=true' -d '{"analyzer": "myAnalyzer", "text": "instruments de musique"}'
Index 339 documents...
{
"tokens" : [ {
"token" : "instrument",
"start_offset" : 0,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "musiqu",
"start_offset" : 15,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
Is there a way to adjust positions so the word "musique" will be in position 1 instead of 2?
My problem is coming from phase queries.
You can try using query_string with enable_position_increments:false
Something like
{ "query_string":{
"default_field":"text",
"query":"\"instruments de musique\"",
"enable_position_increments":false
}
}

using word_delimiter with edgeNGram ignores Word_Delimiter Token

I have my custom analyzer as below. But I dont understand how to achieve my goal.
My goal is that I want to have whitespace separated inverted index but also I want to have autocomplete feature after user enters min 3 chars. For that I though to combine word_delimiter and edgeNGram tokens as below
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [
"standard",
"lowercase",
"my_word_delimiter",
"my_edge_ngram_analyzer"
],
"type": "custom"
}
},
"filter": {
"my_word_delimiter": {
"catenate_all": true,
"type": "word_delimiter"
},
"my_edge_ngram_analyzer": {
"min_gram": 3,
"max_gram": 10,
"type": "edgeNGram"
}
}
}
}
}
}
This will give result for "Brother TN-200" as below. But I was expecting "tn" to be also in the reverted index as I have word_delimiter token. why is it not in the inverted index? How can I achieve this?
curl -XGET "localhost:9200/myIndex/_analyze?analyzer=my_analyzer&pr
etty=true" -d "Brother TN-200"
{
{
"token" : "bro",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "brot",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "broth",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "brothe",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "brother",
"start_offset" : 14,
"end_offset" : 21,
"type" : "word",
"position" : 2
}, {
"token" : "tn2",
"start_offset" : 22,
"end_offset" : 28,
"type" : "word",
"position" : 3
}, {
"token" : "tn20",
"start_offset" : 22,
"end_offset" : 28,
"type" : "word",
"position" : 3
}, {
"token" : "tn200",
"start_offset" : 22,
"end_offset" : 28,
"type" : "word",
"position" : 3
}, {
"token" : "200",
"start_offset" : 25,
"end_offset" : 28,
"type" : "word",
"position" : 4
}]
}
UPDATE:
of course, if I use "min_gram": 2, "tn" will be in the reverted index but I dont want this because if any other word consists "tn" inside the word, it will appear in the result list.
For example about "hp" keyword. I am getting products for "Hewlett Packard" as my products are like "hp xxx" but I get also a product called "tech hpc". I dont want this product to be displayed until I type "hpc". Thats the reason I set 3.
If i dont use edgeNGram tokenizer but only word_delimiter, "tn" is in the inverted index as Brother TN-200 will be indexed as brother, tn and 200. that's why I expected that word_delimiter makes the "tn" to be in the inverted index. Does it have no use if I use it with edgeNGram? –
In my_edge_ngram_analyzer the min_gram setting is 3 as a result any Token with length less than 3 codepoints would not show up.
You would need to set this to 2 if you would want TN to show up.
Example:
get <my_index>/_analyze?tokenizer=whitespace&filters=my_edge_ngram_analyzer&text=TN
The above call would return 0 tokens.

How to map one word to another word in elasticsearch?

how can I map a word to another word in Elasticsearch?. That is suppose I have the following data document
{
"carName" : "Porche"
"review": " this car is so awesome"
}
Now when I search good/fantastic etc, it should map to "awesome".
Is there any way I can do this in elasticsearch?
Yes, you can achieve this by using a synonym token filter.
First you need to define a new custom analyzer in your index and use that analyzer in your mapping.
curl -XPUT localhost:9200/cars -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"synonyms"
]
}
},
"filter": {
"synonyms": {
"type": "synonym",
"synonyms": [
"good, awesome, fantastic"
]
}
}
}
},
"mappings": {
"car": {
"properties": {
"carName": {
"type": "string"
},
"review": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
You can add as many synonyms as you want, either in the settings directly or in a separate file that you can reference in the settings using the synonyms_path property.
Then we can index your sample document above:
curl -XPUT localhost:9200/cars/car/1 -d '{
"carName": "Porche",
"review": " this car is so awesome"
}'
What is going to happen is that when the synonyms token filter kicks in, it will also index the tokens good and fantastic along with awesome so that you can search and find that document by those tokens as well. Concretely, analyzing the sentence this car is so awesome...
curl -XGET 'localhost:9200/cars/_analyze?analyzer=my_analyzer&pretty' -d 'this car is so awesome'
...will produce the following tokens (see the last three tokens)
{
"tokens" : [ {
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "car",
"start_offset" : 5,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "is",
"start_offset" : 9,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "so",
"start_offset" : 12,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "good",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
}, {
"token" : "awesome",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
}, {
"token" : "fantastic",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
} ]
}
Finally, you can search like this and the document will be retrieved:
curl -XGET localhost:9200/cars/car/_search?q=review:good

Combine search terms automatically when with Elasticssearch?

Using elasticsearch for searching our documents we discovered that when we search for "wave board" we get no good results, because documents containing "waveboard" are not at the top of the results. Google does this kind of "term combining". Is there a simple way to do this in ES?
Found a good solution: Create a custom anaylzer with a shingle filter using "" as a token separator and use that in a query (use bool query to combine with standard queries)
To do this at analysis time, you can also use what is know as a "decompounding"
token filter. Here is an example to decompound the text "catdogmouse" into the
tokens "cat", "dog", and "mouse":
POST /decom
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"decom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["decom_filter"]
}
},
"filter": {
"decom_filter": {
"type": "dictionary_decompounder",
"word_list": ["cat", "dog", "mouse"]
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"body": {
"type": "string",
"analyzer": "decom_analyzer"
}
}
}
}
}
And then you can see how they are applied to certain terms:
POST /decom/_analyze?field=body&pretty
racecatthings
{
"tokens" : [ {
"token" : "racecatthings",
"start_offset" : 1,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 1,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
And another: (you should be able to extrapolate this to separate "waveboard"
into "wave" and "board")
POST /decom/_analyze?field=body&pretty
catdogmouse
{
"tokens" : [ {
"token" : "catdogmouse",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "dog",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "mouse",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}

Resources