Elasticsearch keyword tokenizer doesn't work with phonetic analyzer - elasticsearch

I want to add a custom phonetic analyzer, also I don't want to analyze my given string. Suppose, I have two string,
KAMRUL ISLAM
KAMRAL ISLAM
I don't want to get any result with a query string KAMRUL but want both two as a result with query string KAMRUL ISLAM.
For this, I have take a custom phonetic analyzer with a keyword tokenizer.
Index Settings :
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"dbl_metaphone": {
"tokenizer": "keyword",
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "keyword",
"filter": "dbl_metaphone"
}
}
}
}
}
Type Mappings:
PUT /my_index/_mapping/my_type
{
"properties": {
"name": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
}
I have inserted data with :
PUT /my_index/my_type/5
{
"name": "KAMRUL ISLAM"
}
And My query String:
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "KAMRAL"
}
}
}
}
Unfortunately I am given both two string. I am using ES-1.7.1. Is there any way to solve this ?
Additionally, While I have run
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=dbl_metaphone' -d 'KAMRUL ISLAM'
I got the result:
{
"tokens": [
{
"token": "KMRL",
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 1
}
]
}
And While running :
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=dbl_metaphone' -d 'KAMRAL'
I have got:
{
"tokens": [
{
"token": "KMRL",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 1
}
]
}

Related

Elastic search filter not working for multi value json

Hi My elastic search index has the mapping as.
"userId": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"userId": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"userName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
Also my search query looks like this
GET : http://localhost:5000/questions/_search
Body is
{
"query": {
"bool": {
"filter": [
{ "term": { "userId.userId": "testuser#demo.com"
}}
]
}
}
}
I am always getting 0 hits. Is there a better value to query multivalue json.
userId.userId field is of text type. If no analyzer is defined, elasticsearch by default uses a standard analyzer. This will tokenize testuser#demo.com into
{
"tokens": [
{
"token": "testuser",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "demo.com",
"start_offset": 9,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 1
}
]
}
You need to use "userId.userId.keyword" field on the userId.userId field. This uses the keyword analyzer instead of the standard analyzer (notice the ".keyword" after userId.userId field).
You are getting 0 hits, because the term query, always searches for exact matching term. And as you are using the standard analyzer (which is the default one) for searching, you will not get correct results
{
"query": {
"bool": {
"filter": [
{
"term": {
"userId.userId.keyword": "testuser#demo.com"
}
}
]
}
}
}
If you want to search for multiple fields use the terms query
{
"query": {
"bool": {
"filter": [
{
"terms": {
"userId.userId.keyword": [
"testuser#demo.com",
"abc.com"
]
}
}
]
}
}
}
Update 1:
You can use the must_not clause along with the term query to get all records that have userId not equal to testuser#demo.com
{
"query": {
"bool": {
"must_not": {
"term": {
"userId.userId.keyword": "testuser#demo.com"
}
}
}
}
}
Terms query returns documents that contain one or more exact terms in a provided field.The terms query is the same as the term query, except you can search for multiple values.
{
"query": {
"terms": {
"userId.userId": [ "testuser#demo.com", "other#demo.com" ],
"boost": 1.0
}
}
}

Elasticsearch analyzer doesn't replace the apostophes (')

Using Elasticsearch v7.0
This is the analyzer I've implemented (http://phoenyx2:9200/search_dev/_settings?pretty=true):
{
"search_dev": {
"settings": {
"index": {
"refresh_interval": "30s",
"number_of_shards": "1",
"provided_name": "search_dev",
"creation_date": "1558444846417",
"analysis": {
"analyzer": {
"my_standard": {
"filter": [
"lowercase"
],
"char_filter": [
"my_char_filter"
],
"tokenizer": "standard"
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"' => "
]
}
}
},
"number_of_replicas": "1",
"uuid": "hYz0ZlWFTDKearW1rpx8lw",
"version": {
"created": "7000099"
}
}
}
}
}
I've recreated the whole index, and there is still no change in the analasis.
I've also run this : url (phoenyx2:9200/search_dev/_analyze)
{
"analyzer":"my_standard",
"field":"stakeholderName",
"text": "test't"
}
Reply was:
{
"tokens": [
{
"token": "test't",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
}
]
}
I was hoping the the returned token would be "testt"
When you re-create an index its not enough to define a new analyzer in the setting.
You also have to specify in the mapping which fields use what analyzer, for example:
"mappings":{
"properties":{
"stakeholderName": {
"type":"text",
"analyzer":"my_analyzer",
},
}
}
You're mapping (probably) looks like:
"mappings":{
"properties":{
"stakeholderName": {
"type":"text",
},
}
}
Basicaly if you run your "analyze" test again and drop the field:
{
"analyzer":"my_standard",
"text": "test't"
}
You'll get:
{
"token": "testt",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
}
As you expect it, so bad news buddy but you have to re-index all your data again and this time specify in the mapping which analyzer you want to be used for each field, otherwise elastic will default to their standard analyzer every time.

No results returned for filtered Elasticsearch query

I'm having trouble executing the following request against Elasticsearch v2.2.0. If I remove the filter property (and contents, of course), I get my entity back (only one exists). With the filter clause in place, I just get 0 results, but no error. Same if I remove the email filter and/or the name filter. Am I doing something wrong with this request?
Request
GET http://localhost:9200/my-app/my-entity/_search?pretty=1
{
"query": {
"filtered" : {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"term": {
"email": "my.email#email.com"
}
},
{
"term": {
"name": "Test1"
}
}
]
}
}
}
}
Existing Entity
{
"email": "my.email#email.com",
"name": "Test1"
}
Mapping
"properties": {
"name": {
"type": "string"
},
"email": {
"type": "string"
},
"term": {
"type": "long"
}
}
Since email field is analyzed with no custom analyzer, Standard Analyzer will get applied to it and it will split into tokens.
Read about Standard Tokenizer here.
You can use below command to see how my.email#email.com is getting tokenized.
curl -XGET "http://localhost:9200/_analyze?tokenizer=standard" -d "my.email#email.com".
This will generate following output.
{
"tokens": [
{
"token": "my.email", ===> Notice this
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "email.com", ===> Notice this
"start_offset": 9,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
}
]
}
If you want full or exact search you need to make it not_analyzed. Study about how to create a not_analyzed field here.
{
"email": {
"type": "string",
"index": "not_analyzed"
}
}
Hope it is clear

Phonetic search results for integers with Elasticserach

Forgive me as I am new to Elasticsearch, but I am following the Phonetic start guide found here: Phonetic Matching
I have the following
POST /app
{
"settings": {
"index": {
"analysis": {
"filter": {
"dbl_metaphone": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "standard",
"filter": "dbl_metaphone"
}
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
},
"year": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
}
}
}
} }
I add some results by doing:
POST /app/movie
{ "title": "300", "year": 2006"} & { "title":"500 days of summer", "year": "2009" }
I want to query for the movie '300' by entering this query though:
POST /app/movie/_search
{
"query": {
"match": {
"title.phonetic": {
"query": "three hundred"
}
}
}
}
but I get no results. If change my query to "300" though it works just fine.
If I do:
GET /app/_analyze?analyzer=dbl_metaphone&text=300
{
"tokens": [
{
"token": "300",
"start_offset": 0,
"end_offset": 3,
"type": "<NUM>",
"position": 0
}
]
}
I see that there is only a number token returned not alphanumeric version like:
GET /app/_analyze?analyzer=dbl_metaphone&text=three hundred
{
"tokens": [
{
"token": "0R",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "TR",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "HNTR",
"start_offset": 6,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Is there something that I am missing with my phonetic query that I am supposed to define to get both the numerical and alphanumeric tokens?
That is not possible. Double Metaphone is a form of phonetic encoding algorithm.
Simply put it tries to encode similarly pronounced words to the same key.
This facilitates to search for terms like names that could be spelt differently but sound the same.
As you can see from the algorithm double metaphone ignores numbers/numeric characters.
You can read more about double metaphone here.
A better case for phonetic matching is finding "Judy Steinheiser" when the search query is [Jodi Stynehaser].
If you need to be able to search numbers using English, then you'll need to create some synonyms or alternate text at index-time, so that both "300" and "three hundred" are stored in Elasticsearch.
Shouldn't be too hard to find/write a function that converts integers to English.
Call your function when constructing your document to ingest into ES.
Alternately, write it in Groovy, and call it as a Transform script in your mapping.

Elasticsearch custom analyzer for hyphens, underscores, and numbers

Admittedly, I'm not that well versed on the analysis part of ES. Here's the index layout:
{
"mappings": {
"event": {
"properties": {
"ipaddress": {
"type": "string"
},
"hostname": {
"type": "string",
"analyzer": "my_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "word_delimiter",
"preserve_original": true
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase", "my_filter"]
}
}
}
}
}
You can see that I've attempted to use a custom analyzer for the hostname field. This kind of works when I use this query to find the host named "WIN_1":
{
"query": {
"match": {
"hostname": "WIN_1"
}
}
}
The issue is that it also returns any hostname that has a 1 in it. Using the _analyze endpoint, I can see that the numbers are tokenized as well.
{
"tokens": [
{
"token": "win_1",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "win",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "1",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 2
}
]
}
What I'd like to be able to do is search for WIN and get back any host that has WIN in it's name. But I also need to be able to search for WIN_1 and get back that exact host or any host with WIN_1 in it's name. Below is some test data.
{
"ipaddress": "192.168.1.253",
"hostname": "WIN_8_ENT_1"
}
{
"ipaddress": "10.0.0.1",
"hostname": "server1"
}
{
"ipaddress": "172.20.10.36",
"hostname": "ServA-1"
}
Hopefully someone can point me in the right direction. It could be that my simple query isn't the right approach either. I've poured over the ES docs, but they aren't real good with examples.
You could change your analysis to use a pattern analyzer that discards the digits and under scores:
{
"analysis": {
"analyzer": {
"word_only": {
"type": "pattern",
"pattern": "([^\p{L}]+)"
}
}
}
}
Using the analyze API:
curl -XGET 'localhost:9200/{yourIndex}/_analyze?analyzer=word_only&pretty=true' -d 'WIN_8_ENT_1'
returns:
"tokens" : [ {
"token" : "win",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "ent",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 2
} ]
Your mapping would become:
{
"event": {
"properties": {
"ipaddress": {
"type": "string"
},
"hostname": {
"type": "string",
"analyzer": "word_only",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
You can use a multi_match query to get the results you want:
{
"query": {
"multi_match": {
"fields": [
"hostname",
"hostname.raw"
],
"query": "WIN_1"
}
}
}
Here's the analyzer and queries I ended up with:
{
"mappings": {
"event": {
"properties": {
"ipaddress": {
"type": "string"
},
"hostname": {
"type": "string",
"analyzer": "hostname_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"hostname_filter": {
"type": "pattern_capture",
"preserve_original": 0,
"patterns": [
"(\\p{Ll}{3,})"
]
}
},
"analyzer": {
"hostname_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [ "lowercase", "hostname_filter" ]
}
}
}
}
}
Queries:
Find host name starting with:
{
"query": {
"prefix": {
"hostname.raw": "WIN_8"
}
}
}
Find host name containing:
{
"query": {
"multi_match": {
"fields": [
"hostname",
"hostname.raw"
],
"query": "WIN"
}
}
}
Thanks to Dan for getting me in the right direction.
When ES 1.4 is released, there will be a new filter called 'keep types' that will allow you to only keep certain types once the string is tokenized. (i.e. keep words only, numbers only, etc).
Check it out here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-keep-types-tokenfilter.html#analysis-keep-types-tokenfilter
This may be a more convenient solution for your needs in the future
It looks like you want to apply two different types of searches on your hostname field. One for exact matches, and one for a variation of wildcard (maybe in your specific case, a prefix query).
After trying to implement all types of different searches using several different analyzers, I've found it sometimes simpler to add another field to represent each type of search you want to do. Is there a reason you do not want to add another field like the following:
{
"ipaddress": "192.168.1.253",
"hostname": "WIN_8_ENT_1"
"system": "WIN"
}
Otherwise, you could consider writing your own custom filter that does effectively the same thing under the hood. Your filter will read in your hostname field and index the exact keyword and a substring that matches your stemming pattern (e.g. WIN in WIN_8_ENT_1).
I do not think there is any existing analyzer/filter combination that can do what you are looking for, assuming I have understood your requirements correctly.

Resources