Elasticsearch search for Turkish characters - elasticsearch

I have some documents that i am indexing with elasticsearch. But some of the documents are written with upper case and Tukish characters are changed. For example "kürşat" is written as "KURSAT".
I want to find this document by searching "kürşat". How can i do that?
Thanks

Take a look at the asciifolding token filter.
Here is a small example for you to try out in Sense:
Index:
DELETE test
PUT test
{
"settings": {
"analysis": {
"filter": {
"my_ascii_folding": {
"type": "asciifolding",
"preserve_original": true
}
},
"analyzer": {
"turkish_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_ascii_folding"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"analyzer": "turkish_analyzer"
}
}
}
}
}
POST test/test/1
{
"name": "kürşat"
}
POST test/test/2
{
"name": "KURSAT"
}
Query:
GET test/_search
{
"query": {
"match": {
"name": "kursat"
}
}
}
Response:
"hits": {
"total": 2,
"max_score": 0.30685282,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.30685282,
"_source": {
"name": "KURSAT"
}
},
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.30685282,
"_source": {
"name": "kürşat"
}
}
]
}
Query:
GET test/_search
{
"query": {
"match": {
"name": "kürşat"
}
}
}
Response:
"hits": {
"total": 2,
"max_score": 0.4339554,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.4339554,
"_source": {
"name": "kürşat"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.09001608,
"_source": {
"name": "KURSAT"
}
}
]
}
Now the 'preserve_original' flag will make sure that if a user types: 'kürşat', documents with that exact match will be ranked higher than documents that have 'kursat' (Notice the difference in scores for both query responses).
If you want the score to be equal, you can put the flag on false.
Hope I got your problem right!

Related

How to make flattened sub-field in the nested field in elastic search?

Here, I have a indexed document like:
doc = {
"id": 1,
"content": [
{
"txt": I,
"time": 0,
},
{
"txt": have,
"time": 1,
},
{
"txt": a book,
"time": 2,
},
{
"txt": do not match this block,
"time": 3,
},
]
}
And I want to match "I have a book", and return the matched time: 0,1,2. Is there anyone who knows how to build the index and the query for this situation?
I think the "content.txt" should be flattened but "content.time" should be nested?
want to match "I have a book", and return the matched time: 0,1,2.
Adding a working example with index mapping,search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"content": {
"type": "nested"
}
}
}
}
Search Query:
{
"query": {
"nested": {
"path": "content",
"query": {
"bool": {
"must": [
{
"match": {
"content.txt": "I have a book"
}
}
]
}
},
"inner_hits": {}
}
}
}
Search Result:
"inner_hits": {
"content": {
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 2.5226097,
"hits": [
{
"_index": "64752029",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "content",
"offset": 2
},
"_score": 2.5226097,
"_source": {
"txt": "a book",
"time": 2
}
},
{
"_index": "64752029",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "content",
"offset": 0
},
"_score": 1.5580825,
"_source": {
"txt": "I",
"time": 0
}
},
{
"_index": "64752029",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "content",
"offset": 1
},
"_score": 1.5580825,
"_source": {
"txt": "have",
"time": 1
}
}
]
}
}
}
}

Boolean similarity - is there a way to remove duplicates

Given the following index
PUT /test_index
{
"mappings": {
"properties": {
"field1": {
"type": "text",
"analyzer": "whitespace",
"similarity": "boolean"
},
"field2": {
"type": "text",
"analyzer": "whitespace",
"similarity": "boolean"
}
}
}
}
and the following data
POST /test_index/_bulk?refresh=true
{ "index" : {} }
{ "field1": "foo", "field2": "bar"}
{ "index" : {} }
{ "field1": "foo1 foo2", "field2": "bar1 bar2"}
{ "index" : {} }
{ "field1": "foo1 foo2 foo3", "field2": "bar1 bar2 bar3"}
for the given Boolean similarity query
POST /test_index/_search
{
"size": 10,
"min_score": 0.4,
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"fuzzy":{
"field1":{
"value":"foo",
"fuzziness":"AUTO",
"boost": 1
}
}
},
{
"fuzzy":{
"field2":{
"value":"bar",
"fuzziness":"AUTO",
"boost": 1
}
}
}
]
}
}
}
}
}
I'm always receiving ["foo1 foo2 foo3", "bar1 bar2 bar3"] despite the fact that there is an exact result in index (the first one):
{
"took": 114,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 3.9999998,
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "bXw8eXUBCTtfNv84bNPr",
"_score": 3.9999998,
"_source": {
"field1": "foo1 foo2 foo3",
"field2": "bar1 bar2 bar3"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "bHw8eXUBCTtfNv84bNPr",
"_score": 2.6666665,
"_source": {
"field1": "foo1 foo2",
"field2": "bar1 bar2"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "a3w8eXUBCTtfNv84bNPr",
"_score": 2.0,
"_source": {
"field1": "foo",
"field2": "bar"
}
}
]
}
}
I'm aware of the fact that Boolean works that way to match as many results, and I know I can do rescoring here, but this is not an option since I don't know how many top N results to fetch.
Are there any other options here? Maybe to create my own similarity plugin based on Boolean similarity to remove duplicates and leave the best matched token, but I don't know where to start from, I see only samples for script and rescore.
Update:- Based on the clarity provided in the comment section of my earlier answer, updating the answer.
Below query returns the expected results
{
"min_score": 0.4,
"size":10,
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"fuzzy": {
"field1": {
"value": "foo",
"fuzziness": "AUTO",
"boost": 0.5
}
}
},
{
"term": { --> used for boosting the exact terms
"field1": {
"value": "foo",
"boost": 1.5 --> further boosting the exact match.
}
}
}
]
}
}
}
}
}
And search results
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "zdMEvHUBlo4-1mHbtvNH",
"_score": 2.0,
"_source": {
"field1": "foo",
"field2": "bar"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "z9MEvHUBlo4-1mHbtvNH",
"_score": 0.99999994,
"_source": {
"field1": "foo1 foo2 foo3",
"field2": "bar1 bar2 bar3"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "ztMEvHUBlo4-1mHbtvNH",
"_score": 0.6666666,
"_source": {
"field1": "foo1 foo2",
"field2": "bar1 bar2"
}
}
]
Another query without the explicit boost of the exact term also returns the expected results
{
"min_score": 0.4,
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"fuzzy": {
"field1": {
"value": "foo",
"fuzziness": "AUTO",
"boost": 0.5
}
}
},
{
"term": {
"field1": {
"value": "foo" --> notice there is no boost
}
}
}
]
}
}
}
}
}
And search result
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "zdMEvHUBlo4-1mHbtvNH",
"_score": 1.5,
"_source": {
"field1": "foo",
"field2": "bar"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "z9MEvHUBlo4-1mHbtvNH",
"_score": 0.99999994,
"_source": {
"field1": "foo1 foo2 foo3",
"field2": "bar1 bar2 bar3"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "ztMEvHUBlo4-1mHbtvNH",
"_score": 0.6666666,
"_source": {
"field1": "foo1 foo2",
"field2": "bar1 bar2"
}
}
]

Elastic search query for name / value pair columns pull

We have one document in elastic search with multiple sections of name/value pair and we want to fetch value's only based on name column value.
"envelopeData": {
"envelopeName": "Bills",
"details": {
"detail": [
{
"name": "UC_CORP",
"value": "76483"
},
{
"name": "UC_CYCLE",
"value": "V"
}
We are expecting only 76483 as result based on name equals to UC_CORP
If the field envelopeData.details.detail is nested type then you can perform a match query for the desired name on the nested path and can use inner_hits to get just the value.
Map the field envelopeData.details.detail as nested(if not nested):
PUT stackoverflow
{
"mappings": {
"_doc": {
"properties": {
"envelopeData.details.detail": {
"type": "nested"
}
}
}
}
}
then you can perform the following query to get value using inner_hits:
GET stackoverflow/_search
{
"_source": "false",
"query": {
"nested": {
"path": "envelopeData.details.detail",
"query": {
"match": {
"envelopeData.details.detail.name.keyword": "UC_CORP"
}
},
"inner_hits": {
"_source": "envelopeData.details.detail.value"
}
}
}
}
which outputs:
{
"_index": "stackoverflow",
"_type": "_doc",
"_id": "W5GUW2gB3GnGVyg-Sf4T",
"_score": 0.6931472,
"_source": {},
"inner_hits": {
"envelopeData.details.detail": {
"hits": {
"total": 1,
"max_score": 0.6931472,
"hits": [
{
"_index": "stackoverflow",
"_type": "_doc",
"_id": "W5GUW2gB3GnGVyg-Sf4T",
"_nested": {
"field": "envelopeData.details.detail",
"offset": 0
},
"_score": 0.6931472,
"_source": {
"value": "76483" -> Outputs value only
}
}
]
}
}
}
}

How to control scoring or ordering of results while using ngram in Elasticsearch?

I am using Elasticsearch 6.X..
I have created an index test_index with index type doc as follow:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_ngram_tokenizer"
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "1",
"max_gram": "7",
"token_chars": [
"letter",
"digit",
"punctuation"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"my_text": {
"type": "text",
"fielddata": true,
"fields": {
"ngram": {
"type": "text",
"fielddata": true,
"analyzer": "my_analyzer"
}
}
}
}
}
}
}
I have indexed data as follow:
PUT /text_index/doc/1
{
"my_text": "ohio"
}
PUT /text_index/doc/2
{
"my_text": "ohlin"
}
PUT /text_index/doc/3
{
"my_text": "john"
}
Then I used search query:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "oh",
"fields": [
"my_text^5",
"my_text.ngram"
]
}
}
]
}
}
}
And got the response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 1.0042334,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1.0042334,
"_source": {
"my_text": "ohio"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 0.97201055,
"_source": {
"my_text": "john"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.80404717,
"_source": {
"my_text": "ohlin"
}
}
]
}
}
Here, we can see the when I searched for oh, I got results in the order:
-> ohio
-> john
-> ohlin
But, I want to have scoring and order of the results in a way which gives higher priority to matching prefix:
-> ohio
-> ohlin
-> john
How can I achieve such result ? What approaches can I take here ?
Thanks in advance.
You should add a new subfield with a new analyzer using the edge_ngram tokenizer then add the new subfield in your multimatch.
You need then to use the type most_fields for your multimatch query. Then only the documents starting by the search term will match on this subfield and then will be boosted against others matching documents.

Make a full word have more score than a Edge NGram subset

I'm trying to get an higher score on a document where the full name is matched, instead of the Edge NGram subset with the same value.
So the results are:
Pos Name _score _id
1 Baritone horn 7.56878 1786
2 Baritone ukulele 7.56878 2313
3 Bari 7.56878 2360
4 Baritone voice 7.56878 1787
I intended that the third ("Bari") would have an higher score since it's the full name, however, since the edge ngram decomposition will make all the others to have exactly the "bari" word indexed. So has you can see on the results table, the score is equal for all, and I don't even know how elasticsearch order this, since the _id's are not even sequencial, nor the names ordered.
How can I achieve this?
Thanks
Example 'code'
Settings
{
"analysis": {
"filter": {
"edgeNGram_filter": {
"type": "edgeNGram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"edgeNGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"edgeNGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
source
Mapping:
{
"name": {
"type": "string",
"index": "not_analyzed"
},
"suggest": {
"type": "completion",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer",
"payloads": true
}
}
Query:
POST /attribute-tree/attribute/_search
{
"query": {
"match": {
"suggest": "Bari"
}
}
}
Results:
(only left relevant data)
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 7.56878,
"hits": [
{
"_index": "attribute-tree",
"_type": "attribute",
"_id": "1786",
"_score": 7.56878,
"_source": {
"name": "Baritone horn",
"suggest": {
"input": [
"Baritone",
"horn"
],
"output": "Baritone horn"
}
}
},
{
"_index": "attribute-tree",
"_type": "attribute",
"_id": "2313",
"_score": 7.56878,
"_source": {
"name": "Baritone ukulele",
"suggest": {
"input": [
"Baritone",
"ukulele"
],
"output": "Baritone ukulele"
}
}
},
{
"_index": "attribute-tree",
"_type": "attribute",
"_id": "2360",
"_score": 7.56878,
"_source": {
"name": "Bari",
"suggest": {
"input": [
"Bari"
],
"output": "Bari"
}
}
},
{
"_index": "attribute-tree",
"_type": "attribute",
"_id": "1787",
"_score": 7.568078,
"_source": {
"name": "Baritone voice",
"suggest": {
"input": [
"Baritone",
"voice"
],
"output": "Baritone voice"
}
}
}
]
}
}
You can use the bool query operator and its should clause to add score to exact matches like this :
POST /attribute-tree/attribute/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"suggest": "Bari"
}
}
],
"should": [
{
"match": {
"name": "Bari"
}
}
]
}
}
}
The query in the should clause is called a signal clause in the ElasticSearch definitive guide, and this is how you can distinguish between perfect matches and ngram ones. You will have all documents that match the must clause, but the documents matching should queries will have more score due to the bool query scoring formula :
score = ("must" queries total score + matching "should" queries total score) / (total number of "must" queries and "should" queries)
The result is what you expect, Bari is the first result (far ahead in scoring :) ) :
"hits": {
"total": 3,
"max_score": 0.4339554,
"hits": [
{
"_index": "attribute-tree",
"_type": "attribute",
"_id": "2360",
"_score": 0.4339554,
"_source": {
"name": "Bari",
"suggest": {
"input": [
"Bari"
],
"output": "Bari"
}
}
},
{
"_index": "attribute-tree",
"_type": "attribute",
"_id": "1786",
"_score": 0.04500804,
"_source": {
"name": "Baritone horn",
"suggest": {
"input": [
"Baritone",
"horn"
],
"output": "Baritone horn"
}
}
},
{
"_index": "attribute-tree",
"_type": "attribute",
"_id": "2313",
"_score": 0.04500804,
"_source": {
"name": "Baritone ukulele",
"suggest": {
"input": [
"Baritone",
"ukulele"
],
"output": "Baritone ukulele"
}
}
}
]

Resources