To match words with the same pronounciation elasticsearch - elasticsearch

I would like to match words that spells different, but have the same pronounciation. Like "mail" and "male", "plane" and "plain". Can we do such a matching in elasticsearch?

You can use the analysis phonetic plugin for that task.
Let's create an index with a custom analyzer leveraging that plugin:
curl -XPUT localhost:9200/phonetic -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"my_metaphone"
]
}
},
"filter": {
"my_metaphone": {
"type": "phonetic",
"encoder": "metaphone",
"replace": true
}
}
}
}
}'
Now let's analyze your example using that new analyzer. As you can see, both plain and plane will produce the single token PLN:
curl -XGET 'localhost:9200/phonetic/_analyze?analyzer=my_analyzer&pretty' -d 'plane'
curl -XGET 'localhost:9200/phonetic/_analyze?analyzer=my_analyzer&pretty' -d 'plain'
{
"tokens" : [ {
"token" : "PLN",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
Same thing for mail and male which produce the single token ML:
curl -XGET 'localhost:9200/phonetic/_analyze?analyzer=my_analyzer&pretty' -d 'mail'
curl -XGET 'localhost:9200/phonetic/_analyze?analyzer=my_analyzer&pretty' -d 'male'
{
"tokens" : [ {
"token" : "ML",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
I've used the metaphone encoder, but you're free to use any other supported encoders. You can find more information on all supported encoders:
in the Apache Codec documentation for metaphone, double_metaphone, soundex, caverphone, caverphone1, caverphone2, refined_soundex, cologne, beider_morse
in the additional encoders for koelnerphonetik, haasephonetik and nysiis

You can use the phonetic token filter for this purpose. Phonetic token filter is a plugin and it requires separate installation and setup. You can make use of this blog which explains in detail, how to set up and use phonetic token filter.

A solution which doesn't need a plugin is to use a Synonym Token Filter. Example:
{
"filter" : {
"synonym" : {
"type" : "synonym",
"synonyms" : [
"mail, male",
"plane, plain"
]
}
}
}
You can also put the synonyms in a text file and reference that, see the documentation I linked to for an example.

Related

Elasticsearch's minimumShouldMatch for each member of an array

Consider an Elasticsearch entity:
{
"id": 123456,
"keywords": ["apples", "bananas"]
}
Now, imagine I would like to find this entity by searching for apple.
{
"match" : {
"keywords" : {
"query" : "apple",
"operator" : "AND",
"minimum_should_match" : "75%"
}
}
}
The problem is that the 75% minimum for matching would be required for both of the strings of the array – so nothing will be found. Is there a way to say something like minimumSouldMatch: "75% of any array fields"?
Note that I need to use AND as each item of keywords may be composed of longer text.
EDIT:
I tried the proposed solutions, but none of them was giving expected results. I guess the problem is that the text might be quite long, eg.:
["national gallery in prague", "narodni galerie v praze"]
I guess the fuzzy expansion is just not able to expand such long strings if you just start searching by "national g".
Would this may be be possible somehow via Nested objects?
{ keywords: [{keyword: "apples"}, {keyword: "babanas"}}
and then have minimumShouldMatch=1 on keywords and then 75% on each keyword?
As per doc
The match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. The operator parameter can be set to or or and to control the boolean clauses (defaults to or). The minimum number of optional should clauses to match can be set using the minimum_should_match parameter.
If you are searching for multiple tokens example "apples mangoes" and set minimum as 100%. It will mean both tokens should be present in document. If you set it at 50% , it means at least one of these should be present.
If you want to match tokens partially
You can use fuzziness parameter
Using fuzziness you can set maximum edit distance allowed for matching
{
"query": {
"match": {
"keywords": {
"query": "apple",
"fuzziness": "auto"
}
}
}
}
If you are trying to match word to its root form you can use "stemming" token filter
PUT index-name
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [ "stemmer" ]
}
}
}
},
"mappings": {
"properties": {
"keywords":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Tokens generated
GET index-name/_analyze
{
"text": ["apples", "bananas"],
"analyzer": "my_analyzer"
}
"tokens" : [
{
"token" : "appl",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "banana",
"start_offset" : 7,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 101
}
]
stemming breaks words to their root form.
You can also explore n-grams, edge grams for partial matching

Indexing the last word of a string in Elasticsearch

I'm looking for a way to index the last word (or more generally: the last token) of a field into a separate sub-field. I've looked into the Predicate Script token filter but the painless script API in that context only provides the absolute position of the toekn from the start of the original input string so I could find the first token like this:
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "predicate_token_filter",
"script": {
"source": """
token.position == 0
"""
}
}
],
"text": "the fox jumps the lazy dog"
}
This works and results in:
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
But I need the last token, not the first. Is there any way to achieve this without preparing a separate field pre-indexing, outside of Elasticsearch?
You're on the right path!! The solution is not that far from what you have... When you know you can easily fetch the first token, but what you need is the last... just reverse the string...
The following analyzer will output just the token you need, i.e. dog.
We first start by reversing the whole string, then we split by token, use your predicate script to only select the first one and reverse that token again. Voilà!
POST test/_analyze
{
"text": "the fox jumps the lazy dog",
"tokenizer": "keyword",
"filter": [
"reverse",
"word_delimiter",
{
"type": "predicate_token_filter",
"script": {
"source": """
token.position == 0
"""
}
},
"reverse"
]
}
Result:
{
"tokens" : [
{
"token" : "dog",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
}
]
}

Elastic Search multilingual field

I have read through few articles and advices, but unfortunately I haven't found working solution for me.
The problem is I have a field in index that can have content in any possible language and I don't know in which language it is. I need to search and sort on it. It is not localisation, just values in different languages.
The first language (excluding few European) I have tried it on was Japanese. For the beginning I set for this field only one analyzer and tried to search only for Japanese words/phrases. I took example from here. Here is what I used for this:
'analysis': {
"filter": {
...
"ja_pos_filter": {
"type": "kuromoji_part_of_speech",
"stoptags": [
"\\u52a9\\u8a5e-\\u683c\\u52a9\\u8a5e-\\u4e00\\u822c",
"\\u52a9\\u8a5e-\\u7d42\\u52a9\\u8a5e"]
},
...
},
"analyzer": {
...
"ja_analyzer": {
"type": "custom",
"filter": ["kuromoji_baseform", "ja_pos_filter", "icu_normalizer", "icu_folding", "cjk_width"],
"tokenizer": "kuromoji_tokenizer"
},
...
},
"tokenizer": {
"kuromoji": {
"type": "kuromoji_tokenizer",
"mode": "search"
}
}
}
Mapper:
'name': {
'type': 'string',
'index': 'analyzed',
'analyzer': 'ja_analyzer',
}
And here are few tries to get result from it:
{
'filter': {
'query': {
'bool': {
'must': [
{
# 'wildcard': {'name': u'*ネバーランド福島*'}
# 'match': {'name": u'ネバーランド福島'
# },
"query_string": {
"fields": ['name'],
"query": u'ネバーランド福島',
"default_operator": 'AND'
}
},
],
'boost': 1.0
}
}
}
}
None of them works.
If I just take a standard analyser and query in with query_string or brake phrase myself (breaking on whitespace, what i don't have here) and use wildcard *<>* for this it will find me nothing again. Analyser says that ネバーランド and 福島 are separate words/parts:
curl -XPOST 'http://localhost:9200/test/_analyze?analyzer=ja_analyzer&pretty' -d 'ネバーランド福島'
{
"tokens" : [ {
"token" : "ネハラント",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "福島",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
} ]
}
And in case of standard analyser I'll get result if I'll look for ネバーランド I'll get what I want. But if I use customised analyser and try the same or just one symbol I'm still getting nothing.
The behaviour I'm looking for is: breaking query string on words/parts, all words/parts should be present in resulting name field.
Thank you in advance

Elastic exact match w/o changing indexing

I have following query to elastic:
"query": {
"filtered": {
"filter": {
"and": {
"filters": [
{
"term": {
"entities.hashtags": "gf"
}
}
]
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
},
entities.hashtags is array and as a result I receive entries with hashtags gf_anime, gf_whatever, gf_foobar etc.
But what I need is receive entries where exact "gf" hashtag exists.
I've looked in other questions on SO and saw that the solution in this case is to change analyzing of entities.hashtags so it'll match only exact values (I am pretty new with elastic hence can mistake with terms here).
My question is whether it's possible to define exact match search INSIDE THE QUERY? Id est w/o changing how elastic indexes its fields?
Are you sure that you need to do anything? Given your examples, you don't and you probably don't want to do not_analyzed:
curl -XPUT localhost:9200/test -d '{
"mappings": {
"test" : {
"properties": {
"body" : { "type" : "string" },
"entities" : {
"type" : "object",
"properties": {
"hashtags" : {
"type" : "string"
}
}
}
}
}
}
}'
curl -XPUT localhost:9200/test/test/1 -d '{
"body" : "anime", "entities" : { "hashtags" : "gf_anime" }
}'
curl -XPUT localhost:9200/test/test/2 -d '{
"body" : "anime", "entities" : { "hashtags" : ["GF", "gf_anime"] }
}'
curl -XPUT localhost:9200/test/test/3 -d '{
"body" : "anime", "entities" : { "hashtags" : ["gf_whatever", "gf_anime"] }
}'
With the above data indexed, your query only returns document 2 (note: this is simplified version of your query without the unnecessary/undesirable and filter; at least for the time being, you should always use the bool filter rather than and/or as it understands how to use the filter caches):
curl -XGET localhost:9200/test/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"entities.hashtags": "gf"
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
}
}
Where this breaks down is when you start putting in hashtag values that will be split into multiple tokens, thereby triggering false hits with the term filter. You can determine how the field's analyzer will treat any value by passing it to the _analyze endpoint and telling it the field to use the analyzer from:
curl -XGET localhost:9200/test/_analyze?field=entities.hashtags\&pretty -d 'gf_anime'
{
"tokens" : [ {
"token" : "gf_anime",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
# Note the space instead of the underscore:
curl -XGET localhost:9200/test/_analyze?field=entities.hashtags\&pretty -d 'gf anime'
{
"tokens" : [ {
"token" : "gf",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "anime",
"start_offset" : 3,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
If you were to add a fourth document with the "gf anime" variant, then you will get a false hit.
curl -XPUT localhost:9200/test/test/4 -d '{
"body" : "anime", "entities" : { "hashtags" : ["gf whatever", "gf anime"] }
}'
This is really not an indexing problem, but a bad data problem.
With all of the explanation out of the way, you can inefficiently solve this by using a script that always follows the term filter (to efficiently rule out the more common cases that don't hit it):
curl -XGET localhost:9200/test/_search
{
"query": {
"filtered": {
"filter": {
"bool" : {
"must" : [{
"term" : {
"entities.hashtags" : "gf"
}
},
{
"script" : {
"script" :
"_source.entities.hashtags == tag || _source.entities.hashtags.find { it == tag } != null",
"params" : {
"tag" : "gf"
}
}
}]
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
}
}
This works by parsing the original the _source (and not using the indexed doc values). That is why this is not going to be very efficient, but it will work until you reindex. The _source.entities.hashtags == tag portion is only necessary if hashtags is not always an array (in my example, document 1 would not be an array). If it is always an array, then you can use _source.entities.hashtags.contains(tag) instead of _source.entities.hashtags == tag || _source.entities.hashtags.find { it == tag } != null.
Note: The script language is Groovy, which is the default starting in 1.4.0. It is not the default in earlier versions, and it must be explicitly enabled using script.default_lang : groovy.

ElasticSearch nGram filters out punctuation

In my ElasticSearch dataset we have unique IDs that are separated with a period. A sample number might look like c.123.5432
Using an nGram I'd like to be able to search for: c.123.54
This doesn't return any results. I believe the tokenizer is splitting on the period. To account for this I added "punctuation" to the token_chars, but there's no change in results. My analyzer/tokenizer is below.
I've also tried: "token_chars": [] <--Per the documentation this should keep all characters.
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"my_ngram_analyzer" : {
"tokenizer" : "my_ngram_tokenizer"
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "1",
"max_gram" : "10",
"token_chars": [ "letter", "digit", "whitespace", "punctuation", "symbol" ]
}
}
}
}
},
Edit(More info):
This is the mapping of the relevant field:
"ProjectID":{"type":"string","store":"yes", "copy_to" : "meta_data"},
And this is the field I'm copying it into(that also has the ngram analyzer):
"meta_data" : { "type" : "string", "store":"yes", "index_analyzer": "my_ngram_analyzer"}
This is the command I'm using in sense to see if my search worked (see that it's searching the "meta_data" field):
GET /_search?pretty=true
{
"query": {
"match": {
"meta_data": "c.123.54"
}
}
}
Solution from s1monw at https://github.com/elasticsearch/elasticsearch/issues/5120
By using an index_analyzer search only uses a standard analyzer. To fix it I modified index_analyzer to analyzer. Keep in mind the number of results will increase greatly, so changing the min_gram to a higher number may be necessary.

Resources