Phonetic search results for integers with Elasticserach - elasticsearch

Forgive me as I am new to Elasticsearch, but I am following the Phonetic start guide found here: Phonetic Matching
I have the following
POST /app
{
"settings": {
"index": {
"analysis": {
"filter": {
"dbl_metaphone": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "standard",
"filter": "dbl_metaphone"
}
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
},
"year": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
}
}
}
} }
I add some results by doing:
POST /app/movie
{ "title": "300", "year": 2006"} & { "title":"500 days of summer", "year": "2009" }
I want to query for the movie '300' by entering this query though:
POST /app/movie/_search
{
"query": {
"match": {
"title.phonetic": {
"query": "three hundred"
}
}
}
}
but I get no results. If change my query to "300" though it works just fine.
If I do:
GET /app/_analyze?analyzer=dbl_metaphone&text=300
{
"tokens": [
{
"token": "300",
"start_offset": 0,
"end_offset": 3,
"type": "<NUM>",
"position": 0
}
]
}
I see that there is only a number token returned not alphanumeric version like:
GET /app/_analyze?analyzer=dbl_metaphone&text=three hundred
{
"tokens": [
{
"token": "0R",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "TR",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "HNTR",
"start_offset": 6,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Is there something that I am missing with my phonetic query that I am supposed to define to get both the numerical and alphanumeric tokens?

That is not possible. Double Metaphone is a form of phonetic encoding algorithm.
Simply put it tries to encode similarly pronounced words to the same key.
This facilitates to search for terms like names that could be spelt differently but sound the same.
As you can see from the algorithm double metaphone ignores numbers/numeric characters.
You can read more about double metaphone here.

A better case for phonetic matching is finding "Judy Steinheiser" when the search query is [Jodi Stynehaser].
If you need to be able to search numbers using English, then you'll need to create some synonyms or alternate text at index-time, so that both "300" and "three hundred" are stored in Elasticsearch.
Shouldn't be too hard to find/write a function that converts integers to English.
Call your function when constructing your document to ingest into ES.
Alternately, write it in Groovy, and call it as a Transform script in your mapping.

Related

How to optimize elasticsearch's full text search to match strings like 'C++'

We have a search engine for text content which contains strings like c++ or c#. The switch to Elasticsearch has shown that the search does not match on terms like 'c++'. ++ is removed.
How can we teach elasticsearch to match correctly in a full text search and not to remove special characters? Characters like comma , should of course still be removed.
You need to create your own custom-analyzer which generates token as per your requirement, for your example I created a below custom analyzer with a text field name language and indexed some sample docs:
Index creation with a custom analyzer
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"char_filter": [
"replace_comma"
]
}
},
"char_filter": {
"replace_comma": {
"type": "mapping",
"mappings": [
", => \\u0020"
]
}
}
}
},
"mappings": {
"properties": {
"language": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Tokens generated for text like c++, c# and c,java.
POST http://{{hostname}}:{{port}}/{{index}}/_analyze
{
"text" : "c#",
"analyzer": "my_analyzer"
}
{
"tokens": [
{
"token": "c#",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
}
]
}
for c,java it generated 2 separate tokens c and java as it replaces , with whitespace shown below:
{
"text" : "c, java",
"analyzer":"my_analyzer"
}
{
"tokens": [
{
"token": "c",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "java",
"start_offset": 3,
"end_offset": 7,
"type": "word",
"position": 1
}
]
}
Note: You need to understand the analysis process and accordingly modify your custom-analyzer to make it work for all of your use-case, My example might not work for all your edge cases, But hope you get an idea on how to handle such requirements.

how to search a document containing a substring

I have the following document with this (partial) mapping:
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
I'm trying to perform a query for document containing "success":"0" through the following DSL query:
{
"query": {
"bool": {
"must": {
"regexp": {
"message": ".*\"success\".*0.*"
}
}
}
}
}
but I don't get any result, whereas if I perform the following DSL:
{
"query": {
"bool": {
"must": {
"regexp": {
"message": ".*\"success\""
}
}
}
}
}
I'm returned some document! I.e.
{"data":"[{\"appVersion\":\"1.1.1\",\"installationId\":\"any-ubst-id\",\"platform\":\"aaa\",\"brand\":\"Dalvik\",\"screenSize\":\"xhdpi\"}]","executionTime":"0","flags":"0","method":"aaa","service":"myService","success":"0","type":"aservice","version":"1"}
What's wrong with my query?
The text field message uses standard analyzer which tokenize the input string and convert it to tokens.
If we analyze the string "success":"0" using standard analyzer we will get these tokens
{
"tokens": [
{
"token": "success",
"start_offset": 2,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "0",
"start_offset": 12,
"end_offset": 13,
"type": "<NUM>",
"position": 1
}
]
}
So you can see that colon double quotes etc are removed. And since regexp query applied on each token it will not match your query.
But if we use message.keyword which has field type keyword. it is not analyzed thus keep the string as it is.
{
"tokens": [
{
"token": """ "success":"0" """,
"start_offset": 0,
"end_offset": 15,
"type": "word",
"position": 0
}
]
}
So if we use the below query it should work
{
"query": {
"regexp": {
"message.keyword": """.*"success".*0.*"""
}
}
}
But another problem is you have set message.keyword field settings to "ignore_above": 256 So This field will ignore any string longer than 256 characters.

Elasticsearch custom analyser

Is it possible to create custom elasticsearch analyser which can split index by space and then create two tokens? One, with everything before space and second, with everything.
For example: I have stored record with field which has following text: '35 G'.
Now I want to receive that record by typing only '35' or '35 G' query to that field.
So elastic should create two tokens: ['35', '35 G'] and no more.
If it's possible, how to achieve it ?
Doable using path_hierarchy tokenizer.
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "path_hierarchy",
"delimiter": " "
}
}
}
}
...
}
And now
POST test/_analyze
{
"analyzer": "my_analyzer",
"text": "35 G"
}
outputs
{
"tokens": [
{
"token": "35",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "35 G",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
}
]
}

Elasticsearch: FVH highlights with multiple pre and post tags marking tokens incorrectly?

I'm querying my index using boolean query with two match terms. For each term I have a separate set of pre- and post- tags. Using highlights I would like to obtain the documents in which both terms exist and see which tokens were matched as each of them. The index contains documents in Polish analyzed using morfologik. Let's call the two terms I'm searching for aspect and feature. I want to query the index and retrieve the documents in which both a specific aspect and feature exist and I want the highlight feature to mark the aspect token with <aspect> tag and the feature with <feature> tag. Most of the time it works as expected, sometimes, though, Elasticsearch is marking one or both of the tokens incorrectly. I'll give you an example.
So let's say my index contains the following document:
"Najlepsza maseczka na zniszczone włosy!"
If I search for "maseczka" (aspect) and "dobry" (feature) I expect the output to be like this:
"<feature>Najlepsza</feature> <aspect>maseczka</aspect> na zniszczone włosy! "
For some reason the results from Elasticsearch are like this:
"<aspect>Najlepsza</aspect> <aspect>maseczka</aspect> na zniszczone włosy! "
What I know so far:
I thought maybe the aspect and feature have similar form when analyzed, but it's not the case, for example _analyze for the above example returns:
#query
GET my_index/_analyze
{
"analyzer": "morfologik",
"text": "dobra maseczka"
}
#results
{
"tokens": [
{
"token": "dobra",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "dobro",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "dobry",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "maseczka",
"start_offset": 6,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
}
]
}
# Analysis of the document:
get my_index/_analyze
{
"analyzer": "morfologik",
"text": "Najlepsza maseczka na zniszczone włosy"
}
# response
{
"tokens": [
{
"token": "dobry",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "maseczka",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "na",
"start_offset": 19,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 2
},
...
]
}
it's also not a problem with specific aspect or feature, because for some queries the index will return both correctly and incorrectly highlighted documents (so I'd expect it to be a problem with documents, rather than queries)
in some cases both terms are highlighted as aspects, in some aspect is marked as feature and feature as aspect, I haven't found any rule so far
I thought if my search terms match the order of the highlights tags, the first term should always get the first tag and the second term always the second tag, but maybe they work in a different way? I thought that's how it works inspired by this response:
Using the Fast Vector Highlighter, you can specify tags in order of "importance" which seems to mean that their order and the order of your search terms should match.
Here's how my index is constructed:
{
"settings": {
"analysis": {
"analyzer": {
"morfologik": {
"tokenizer": "standard",
"filter": [
"morfologik_stem",
"lowercase"
],
"type": "custom"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "morfologik"
},
"original_doc": {
"type": "integer"
}
}
}
}
}
Here's my query:
GET my_index/_search
{
"query": {
"bool": {
"must": [
{ "match" : { "content" : "maseczki" } },
{ "match" : { "content" : "dobre" } }
]
}},
"highlight": {
"fields": {
"content": {
"fragment_size": 200,
"type": "fvh",
"pre_tags": ["<aspect>", "<feature>"],
"post_tags": ["</aspect>", "</feature>"]
}
}
}
}
And here's a sample response:
{
"_index": "my_index",
"_type": "doc",
"_id": "R91v7GkB0hUBqPARgC54",
"_score": 16.864662,
"_source": {
"content": "Najlepsza maseczka na zniszczone włosy! ",
"original_doc_id": 74290
},
"highlight": {
"content": [
"<aspect>Najlepsza</aspect> <aspect>maseczka</aspect> na zniszczone włosy! "
]
}
},
As I said, most of the time the query works fine and sometimes the all-aspect-highlighting occurs only for a subset of a specific query results, like it does in case of "(opakowanie, solidne)":
aspect here is in fact feature and feature is aspect
<aspect>solidne</aspect>, naprawdę świetne <feature>opakowanie</feature>
solidne should be marked as feature here
Jedyne do czego mogłabym się przyczepić to <aspect>opakowanie</aspect> które wg mnie niestety nie jest <aspect>solidne</aspect>
In my understanding if you want to do a match query on a space separated string, you should be using tokenizer as whitespace.
I would suggest you to check this tokenizer. https://www.elastic.co/guide/en/elasticsearch/reference/5.4/analysis-whitespace-tokenizer.html

Elasticsearch keyword tokenizer doesn't work with phonetic analyzer

I want to add a custom phonetic analyzer, also I don't want to analyze my given string. Suppose, I have two string,
KAMRUL ISLAM
KAMRAL ISLAM
I don't want to get any result with a query string KAMRUL but want both two as a result with query string KAMRUL ISLAM.
For this, I have take a custom phonetic analyzer with a keyword tokenizer.
Index Settings :
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"dbl_metaphone": {
"tokenizer": "keyword",
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "keyword",
"filter": "dbl_metaphone"
}
}
}
}
}
Type Mappings:
PUT /my_index/_mapping/my_type
{
"properties": {
"name": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
}
I have inserted data with :
PUT /my_index/my_type/5
{
"name": "KAMRUL ISLAM"
}
And My query String:
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "KAMRAL"
}
}
}
}
Unfortunately I am given both two string. I am using ES-1.7.1. Is there any way to solve this ?
Additionally, While I have run
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=dbl_metaphone' -d 'KAMRUL ISLAM'
I got the result:
{
"tokens": [
{
"token": "KMRL",
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 1
}
]
}
And While running :
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=dbl_metaphone' -d 'KAMRAL'
I have got:
{
"tokens": [
{
"token": "KMRL",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 1
}
]
}

Resources