Elasticsearch analyzer doesn't replace the apostophes (') - elasticsearch

Using Elasticsearch v7.0
This is the analyzer I've implemented (http://phoenyx2:9200/search_dev/_settings?pretty=true):
{
"search_dev": {
"settings": {
"index": {
"refresh_interval": "30s",
"number_of_shards": "1",
"provided_name": "search_dev",
"creation_date": "1558444846417",
"analysis": {
"analyzer": {
"my_standard": {
"filter": [
"lowercase"
],
"char_filter": [
"my_char_filter"
],
"tokenizer": "standard"
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"' => "
]
}
}
},
"number_of_replicas": "1",
"uuid": "hYz0ZlWFTDKearW1rpx8lw",
"version": {
"created": "7000099"
}
}
}
}
}
I've recreated the whole index, and there is still no change in the analasis.
I've also run this : url (phoenyx2:9200/search_dev/_analyze)
{
"analyzer":"my_standard",
"field":"stakeholderName",
"text": "test't"
}
Reply was:
{
"tokens": [
{
"token": "test't",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
}
]
}
I was hoping the the returned token would be "testt"

When you re-create an index its not enough to define a new analyzer in the setting.
You also have to specify in the mapping which fields use what analyzer, for example:
"mappings":{
"properties":{
"stakeholderName": {
"type":"text",
"analyzer":"my_analyzer",
},
}
}
You're mapping (probably) looks like:
"mappings":{
"properties":{
"stakeholderName": {
"type":"text",
},
}
}
Basicaly if you run your "analyze" test again and drop the field:
{
"analyzer":"my_standard",
"text": "test't"
}
You'll get:
{
"token": "testt",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
}
As you expect it, so bad news buddy but you have to re-index all your data again and this time specify in the mapping which analyzer you want to be used for each field, otherwise elastic will default to their standard analyzer every time.

Related

Elasticsearch custom analyser

Is it possible to create custom elasticsearch analyser which can split index by space and then create two tokens? One, with everything before space and second, with everything.
For example: I have stored record with field which has following text: '35 G'.
Now I want to receive that record by typing only '35' or '35 G' query to that field.
So elastic should create two tokens: ['35', '35 G'] and no more.
If it's possible, how to achieve it ?
Doable using path_hierarchy tokenizer.
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "path_hierarchy",
"delimiter": " "
}
}
}
}
...
}
And now
POST test/_analyze
{
"analyzer": "my_analyzer",
"text": "35 G"
}
outputs
{
"tokens": [
{
"token": "35",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "35 G",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
}
]
}

Elasticsearch keyword tokenizer doesn't work with phonetic analyzer

I want to add a custom phonetic analyzer, also I don't want to analyze my given string. Suppose, I have two string,
KAMRUL ISLAM
KAMRAL ISLAM
I don't want to get any result with a query string KAMRUL but want both two as a result with query string KAMRUL ISLAM.
For this, I have take a custom phonetic analyzer with a keyword tokenizer.
Index Settings :
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"dbl_metaphone": {
"tokenizer": "keyword",
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "keyword",
"filter": "dbl_metaphone"
}
}
}
}
}
Type Mappings:
PUT /my_index/_mapping/my_type
{
"properties": {
"name": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
}
I have inserted data with :
PUT /my_index/my_type/5
{
"name": "KAMRUL ISLAM"
}
And My query String:
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "KAMRAL"
}
}
}
}
Unfortunately I am given both two string. I am using ES-1.7.1. Is there any way to solve this ?
Additionally, While I have run
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=dbl_metaphone' -d 'KAMRUL ISLAM'
I got the result:
{
"tokens": [
{
"token": "KMRL",
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 1
}
]
}
And While running :
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=dbl_metaphone' -d 'KAMRAL'
I have got:
{
"tokens": [
{
"token": "KMRL",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 1
}
]
}

In span_first query can we specify "end" paramter based on actual string that is stored in ES or do i have to specify in terms of tokens stored in ES

I asked previous question here Query in Elasticsearch for retrieving strings that start with a particular word on elasticsearch and my problem was solved by using span_first query but now my problem has been changed a bit, now my mapping has been changed because now i want to store words ending with apostrophe 's' as "word", "words", "word's" for example see below case
"joseph's -> "joseph's", "josephs", "joseph"
My mapping is given below
curl -X PUT "http://localhost:9200/colleges/" -d
'{
"settings": {
"index": {
"analysis": {
"char_filter": {
"apostrophe_comma": {
"type": "pattern_replace",
"pattern": "\\b((\\w+)\\u0027S)\\b",
"replacement": "$1 $2s $2"
}
},
"analyzer": {
"simple_wildcard": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter" : ["apostrophe_comma"],
"filter": ["lowercase", "unique"]
}
}
}
}
},
"mappings" : {
"college": {
"properties":{
"college_name" : { "type" : "string", "index": "analyzed", "analyzer": "simple_wildcard"}
}
}
}
}'
My span_first query i was using
"span_first" : {
"match" : {
"span_term" : { "college_name" : first_string[0] }
},
"end" : 1
}
Now the problem i am facing is consider below example
Suppose i have "Donald Duck's" now if anyone would search for "Donald Duck", "Donald Duck's", "Donald Ducks" etc i want them to give "Donald Duck's" but by using span_first query it is not happening because as due to mapping i have 4 tokens now "Donald", "Duck", "Ducks" and "Duck's". now for Donald "end" used in span_first query will be 1, but for other three i used 2 but as "end" is different for different tokens of same word i am not getting desired result.
In short my problem is as span_first query uses "end" parameter to describe position from beginning my token must be present now as due to my mapping i have broken one word "Duck's" to "Duck's", "Ducks" and "Duck" because of which all have "end" value different but while querying i can only use one "end" parameter that's why i don't know how to get my desired output.
If anyone of you have worked on span_first query please help me.
You can use english possessive stemmer to remove 's and english stemmer which maps to porter stem algorithm to handle plurals.
POST colleges
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"simple_wildcard": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"unique",
"english_possessive_stemmer",
"light_english_stemmer"
]
}
},
"filter": {
"light_english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
}
},
"mappings": {
"college": {
"properties": {
"college_name": {
"type": "string",
"index": "analyzed",
"analyzer": "simple_wildcard"
}
}
}
}
}
After that you will have to make two queries to get the right result. First you would have to run the user query through analyze api to get the tokens which you will pass to span queries.
GET colleges/_analyze
{
"text" : "donald ducks duck's",
"analyzer" : "simple_wildcard"
}
The output would be the tokens which will be passed to next phase i.e span query.
{
"tokens": [
{
"token": "donald",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "duck",
"start_offset": 7,
"end_offset": 12,
"type": "word",
"position": 1
},
{
"token": "duck",
"start_offset": 13,
"end_offset": 19,
"type": "word",
"position": 2
}
]
}
The tokens donald, duck, duck will be passed with end position as 1, 2 and 3 respectively.
NOTE : No stemming algorithm is 100%, you might miss some singular/plural combination. For this you could log your queries and then use either synonym token filter or mapping char filter.
Hope this solves the problem.

Phonetic search results for integers with Elasticserach

Forgive me as I am new to Elasticsearch, but I am following the Phonetic start guide found here: Phonetic Matching
I have the following
POST /app
{
"settings": {
"index": {
"analysis": {
"filter": {
"dbl_metaphone": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "standard",
"filter": "dbl_metaphone"
}
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
},
"year": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
}
}
}
} }
I add some results by doing:
POST /app/movie
{ "title": "300", "year": 2006"} & { "title":"500 days of summer", "year": "2009" }
I want to query for the movie '300' by entering this query though:
POST /app/movie/_search
{
"query": {
"match": {
"title.phonetic": {
"query": "three hundred"
}
}
}
}
but I get no results. If change my query to "300" though it works just fine.
If I do:
GET /app/_analyze?analyzer=dbl_metaphone&text=300
{
"tokens": [
{
"token": "300",
"start_offset": 0,
"end_offset": 3,
"type": "<NUM>",
"position": 0
}
]
}
I see that there is only a number token returned not alphanumeric version like:
GET /app/_analyze?analyzer=dbl_metaphone&text=three hundred
{
"tokens": [
{
"token": "0R",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "TR",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "HNTR",
"start_offset": 6,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Is there something that I am missing with my phonetic query that I am supposed to define to get both the numerical and alphanumeric tokens?
That is not possible. Double Metaphone is a form of phonetic encoding algorithm.
Simply put it tries to encode similarly pronounced words to the same key.
This facilitates to search for terms like names that could be spelt differently but sound the same.
As you can see from the algorithm double metaphone ignores numbers/numeric characters.
You can read more about double metaphone here.
A better case for phonetic matching is finding "Judy Steinheiser" when the search query is [Jodi Stynehaser].
If you need to be able to search numbers using English, then you'll need to create some synonyms or alternate text at index-time, so that both "300" and "three hundred" are stored in Elasticsearch.
Shouldn't be too hard to find/write a function that converts integers to English.
Call your function when constructing your document to ingest into ES.
Alternately, write it in Groovy, and call it as a Transform script in your mapping.

Elasticsearch custom analyzer for hyphens, underscores, and numbers

Admittedly, I'm not that well versed on the analysis part of ES. Here's the index layout:
{
"mappings": {
"event": {
"properties": {
"ipaddress": {
"type": "string"
},
"hostname": {
"type": "string",
"analyzer": "my_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "word_delimiter",
"preserve_original": true
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase", "my_filter"]
}
}
}
}
}
You can see that I've attempted to use a custom analyzer for the hostname field. This kind of works when I use this query to find the host named "WIN_1":
{
"query": {
"match": {
"hostname": "WIN_1"
}
}
}
The issue is that it also returns any hostname that has a 1 in it. Using the _analyze endpoint, I can see that the numbers are tokenized as well.
{
"tokens": [
{
"token": "win_1",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "win",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "1",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 2
}
]
}
What I'd like to be able to do is search for WIN and get back any host that has WIN in it's name. But I also need to be able to search for WIN_1 and get back that exact host or any host with WIN_1 in it's name. Below is some test data.
{
"ipaddress": "192.168.1.253",
"hostname": "WIN_8_ENT_1"
}
{
"ipaddress": "10.0.0.1",
"hostname": "server1"
}
{
"ipaddress": "172.20.10.36",
"hostname": "ServA-1"
}
Hopefully someone can point me in the right direction. It could be that my simple query isn't the right approach either. I've poured over the ES docs, but they aren't real good with examples.
You could change your analysis to use a pattern analyzer that discards the digits and under scores:
{
"analysis": {
"analyzer": {
"word_only": {
"type": "pattern",
"pattern": "([^\p{L}]+)"
}
}
}
}
Using the analyze API:
curl -XGET 'localhost:9200/{yourIndex}/_analyze?analyzer=word_only&pretty=true' -d 'WIN_8_ENT_1'
returns:
"tokens" : [ {
"token" : "win",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "ent",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 2
} ]
Your mapping would become:
{
"event": {
"properties": {
"ipaddress": {
"type": "string"
},
"hostname": {
"type": "string",
"analyzer": "word_only",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
You can use a multi_match query to get the results you want:
{
"query": {
"multi_match": {
"fields": [
"hostname",
"hostname.raw"
],
"query": "WIN_1"
}
}
}
Here's the analyzer and queries I ended up with:
{
"mappings": {
"event": {
"properties": {
"ipaddress": {
"type": "string"
},
"hostname": {
"type": "string",
"analyzer": "hostname_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"hostname_filter": {
"type": "pattern_capture",
"preserve_original": 0,
"patterns": [
"(\\p{Ll}{3,})"
]
}
},
"analyzer": {
"hostname_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [ "lowercase", "hostname_filter" ]
}
}
}
}
}
Queries:
Find host name starting with:
{
"query": {
"prefix": {
"hostname.raw": "WIN_8"
}
}
}
Find host name containing:
{
"query": {
"multi_match": {
"fields": [
"hostname",
"hostname.raw"
],
"query": "WIN"
}
}
}
Thanks to Dan for getting me in the right direction.
When ES 1.4 is released, there will be a new filter called 'keep types' that will allow you to only keep certain types once the string is tokenized. (i.e. keep words only, numbers only, etc).
Check it out here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-keep-types-tokenfilter.html#analysis-keep-types-tokenfilter
This may be a more convenient solution for your needs in the future
It looks like you want to apply two different types of searches on your hostname field. One for exact matches, and one for a variation of wildcard (maybe in your specific case, a prefix query).
After trying to implement all types of different searches using several different analyzers, I've found it sometimes simpler to add another field to represent each type of search you want to do. Is there a reason you do not want to add another field like the following:
{
"ipaddress": "192.168.1.253",
"hostname": "WIN_8_ENT_1"
"system": "WIN"
}
Otherwise, you could consider writing your own custom filter that does effectively the same thing under the hood. Your filter will read in your hostname field and index the exact keyword and a substring that matches your stemming pattern (e.g. WIN in WIN_8_ENT_1).
I do not think there is any existing analyzer/filter combination that can do what you are looking for, assuming I have understood your requirements correctly.

Resources