ElasticSearch match score - elasticsearch

I have a simple field of type "text" in my index.
"keywordName": {
"type": "text"
}
And I have these documents already inserted : "samsung", "samsung galaxy", "samsung cover", "samsung charger".
If I make a simple "match" query, the results are disturbing:
Query:
GET keywords/_search
{
"query": {
"match": {
"keywordName": "samsung"
}
}
}
Results:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1.113083,
"hits": [
{
"_index": "keywords",
"_type": "keyword",
"_id": "samsung galaxy",
"_score": 1.113083,
"_source": {
"keywordName": "samsung galaxy"
}
},
{
"_index": "keywords",
"_type": "keyword",
"_id": "samsung charger",
"_score": 0.9433406,
"_source": {
"keywordName": "samsung charger"
}
},
{
"_index": "keywords",
"_type": "keyword",
"_id": "samsung",
"_score": 0.8405092,
"_source": {
"keywordName": "samsung"
}
},
{
"_index": "keywords",
"_type": "keyword",
"_id": "samsung cover",
"_score": 0.58279467,
"_source": {
"keywordName": "samsung cover"
}
}
]
}
}
First Question : Why "samsung" has not the highest score?
Second Question : How can I make a query or analyser which gives me "samsung" as highest score?

Starting from the same index settings (analyzers, filters, mappings) as in my previous reply, I suggest the following solution. But, as I mentioned, you need to lay down all the requirements in terms of what you need to search for in this index and consider all of this as a complete solution.
DELETE test
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"custom_stop": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"my_stop",
"my_snow",
"asciifolding"
]
}
},
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_french_"
},
"my_snow": {
"type": "snowball",
"language": "French"
}
}
}
},
"mappings": {
"test": {
"properties": {
"keywordName": {
"type": "text",
"analyzer": "custom_stop",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
}
POST /test/test/_bulk
{"index":{}}
{"keywordName":"samsung galaxy"}
{"index":{}}
{"keywordName":"samsung charger"}
{"index":{}}
{"keywordName":"samsung cover"}
{"index":{}}
{"keywordName":"samsung"}
GET /test/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"keywordName": {
"query": "samsungs",
"operator": "and"
}
}
},
{
"term": {
"keywordName.raw": {
"value": "samsungs"
}
}
},
{
"fuzzy": {
"keywordName.raw": {
"value": "samsungs",
"fuzziness": 1
}
}
}
]
}
},
"size": 10
}

Related

Getting the document with searched term first, before its synonyms [Elastic]

I think I should explain my problem with an example:
Assume that I've created index with synonym analyzer and I declare that "laptop", "phone" and "tablet" are similar words that can be generalized as "mobile":
PUT synonym
{
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 2,
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"phone, tablet, laptop => mobile"
]
}
}
}
}
},
"mappings": {
"synonym" : {
"properties" : {
"field1" : {
"type" : "text",
"analyzer": "synonym",
"search_analyzer": "synonym"
}
}
}
}
}
Now I am creating some docs:
PUT synonym/synonym/1
{
"field1" : "phone"
}
PUT synonym/synonym/2
{
"field1" : "tablet"
}
PUT synonym/synonym/3
{
"field1" : "laptop"
}
Now when I match query for laptop, tablet or phone, the result is always:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.2876821,
"hits": [
{
"_index": "synonym",
"_type": "synonym",
"_id": "2",
"_score": 0.2876821,
"_source": {
"field1": "tablet"
}
},
{
"_index": "synonym",
"_type": "synonym",
"_id": "1",
"_score": 0.18232156,
"_source": {
"field1": "phone"
}
},
{
"_index": "synonym",
"_type": "synonym",
"_id": "3",
"_score": 0.18232156,
"_source": {
"field1": "laptop"
}
}
]
}
}
You can see that the score of tablet is always higher even when I search for laptop.
I know that is because I declared them as similar words.
However, I am trying to figure out how can I query so that document with the search term can appear in the first place, before the similar words in the result list.
It can be done by boosting, but there must be a simpler approach..
Multi-fields to your rescue.
Index the field1 in two ways, one with the synonym analyzer, and the other with a standard analyzer.
Now you can simply use a bool-should query to add score for match on field1 (synonym) and on field1.raw (standard).
So, your mappings should be like so:
PUT synonym
{
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 2,
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"phone, tablet, laptop => mobile"
]
}
}
}
}
},
"mappings": {
"synonym": {
"properties": {
"field1": {
"type": "text",
"analyzer": "synonym",
"search_analyzer": "synonym",
"fields": {
"raw": {
"type": "text",
"analyzer": "standard"
}
}
}
}
}
}
}
And you can query using:
GET synonyms/_search?search_type=dfs_query_then_fetch
{
"query": {
"bool": {
"should": [
{
"match": {
"field1": "tablet"
}
},
{
"match": {
"field1.raw": "tablet"
}
}
]
}
}
}
Notice: I've used search_type=dfs_query_then_fetch. Since you're testing on 3 shards and have very few documents, the scores you're getting aren't what they should be. This is because the frequencies are calculated per shard. You can use dfs_query_then_fetch while testing but it is discouraged for production. See: https://www.elastic.co/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch

elastic bool query must match mot getting considered

i am basically trying to write a query where it should return the document where
school is "holy international" AND grade is "second".
but the issue with the current query is that its not considering the must match query part. ie even though i don't i specify the school is the giving me this document where as it is not a match.
query is giving me all the documents where the grade is second.
i want only document where school is "holy international" AND grade is "second".
as well as i have not specified in the match query for "schools.school" but its giving me results.
mapping
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase1": {
"tokenizer": "keyword",
"filter": ["lowercase", "my_pattern_replace1", "trim"]
},
"my_keyword_lowercase2": {
"tokenizer": "standard",
"filter": ["lowercase", "trim"]
}
},
"filter": {
"my_pattern_replace1": {
"type": "pattern_replace",
"pattern": ".",
"replacement": ""
}
}
}
},
"mappings": {
"test_data": {
"properties": {
"schools": {
"type": "nested",
"properties": {
"school": {
"type": "string",
"analyzer": "my_keyword_lowercase1"
},
"grade": {
"type": "string",
"analyzer": "my_keyword_lowercase2"
}
}
}
}
}
}
}
data
{
"_index": "data_index",
"_type": "test_data",
"_id": "57a33ebc1d41",
"_version": 1,
"found": true,
"_source": {
"summary": null,
"schools": [{
"school": "little flower",
"grade": "first",
"date": "2007-06-01",
},
{
"school": "holy international",
"grade": "second",
"date": "2007-06-01",
},
],
"first_name": "Adam",
"location": "Kansas City",
"last_name": "Roger",
"country": "US",
"name": "Adam Roger",
}
}
query
{
"_source": ["first_name"],
"query": {
"nested": {
"path": "schools",
"inner_hits": {
"_source": {
"includes": [
"schools.school",
"schools.grade"
]
}
},
"query": {
"bool": {
"must": {
"match": {
"schools.school": {
"query": "" <-----X didnt specify anything
}
}
},
"filter": {
"match": {
"schools.grade": {
"query": "second",
"operator": "and",
"minimum_should_match": "100%"
}
}
}
}
}
}
}
}
result
{
"took": 26,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "data_test",
"_type": "test_data",
"_id": "57a33ebc1d41",
"_score": 0.2876821,
"_source": {
"first_name": "Adam"
},
"inner_hits": {
"schools": {
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_nested": {
"field": "schools",
"offset": 0
},
"_score": 0.2876821,
"_source": {
"schools": {
"school": "holy international",
"grade": "second"
}
}
}
]
}
}
}
}
]
}
}
So, basically your problem is analysis step, when I load everything and checked, it become very clear:
This filter completely wipes all string from schools.school field
"filter": {
"my_pattern_replace1": {
"type": "pattern_replace",
"pattern": ".",
"replacement": ""
}
}
I think, that's happening because . is regexp literal, so, when I checked it:
POST /_analyze
{
"field": "schools.school",
"text": "holy international"
}
{
"tokens": [
{
"token": "",
"start_offset": 0,
"end_offset": 18,
"type": "word",
"position": 0
}
]
}
That's why you always get a match, every string you passed during indexing time and during search time becomes "". Some additional info from Elastic wiki - https://www.elastic.co/guide/en/elasticsearch/reference/5.1/analysis-pattern_replace-tokenfilter.html
After I removed patter replace filter, this query returns everything as expected:
{
"_source": ["first_name"],
"query": {
"nested": {
"path": "schools",
"inner_hits": {
"_source": {
"includes": [
"schools.school",
"schools.grade"
]
}
},
"query": {
"bool": {
"must": {
"match": {
"schools.school": {
"query": "holy international"
}
}
},
"filter": {
"match": {
"schools.grade": {
"query": "second"
}
}
}
}
}
}
}
}

Querying elasticsearch with OR and wildcards

I'm trying to do a simple query to my elasticsearch _type and match multiple fields with wildcards, my first attempt was like this:
POST my_index/my_type/_search
{
"sort" : { "date_field" : {"order" : "desc"}},
"query" : {
"filtered" : {
"filter" : {
"or" : [
{
"term" : { "field1" : "4848" }
},
{
"term" : { "field2" : "6867" }
}
]
}
}
}
}
This example will successfully match every record when field1 OR field2 are exactly equal to 4848 and 6867 respectively.
What I'm trying to do is to match on field1 any text that contains 4848 and field2 that contains 6867 but I'm not really sure how to do it.
I appreciate any help I can get :)
It sounds like your problem has mostly to do with analysis. The appropriate solution depends on the structure of your data and what you want to match. I'll provide a couple of examples.
First, let's assume that your data is such that we can get what we want just using the standard analyzer. This analyzer will tokenize text fields on whitespace, punctuation and symbols. So the text "1234-5678-90" will be broken into the terms "1234", "5678", and "90", so a "term" query or filter for any of those terms will match that document. More concretely:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"field1":{
"type": "string",
"analyzer": "standard"
},
"field2":{
"type": "string",
"analyzer": "standard"
}
}
}
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"field1": "1212-2323-4848","field2": "1234-5678-90"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"field1": "0000-0000-0000","field2": "0987-6543-21"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"field1": "1111-2222-3333","field2": "6867-4545-90"}
POST test_index/_search
{
"query": {
"filtered": {
"filter": {
"or": [
{
"term": { "field1": "4848" }
},
{
"term": { "field2": "6867" }
}
]
}
}
}
}
...
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"field1": "1212-2323-4848",
"field2": "1234-5678-90"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"field1": "1111-2222-3333",
"field2": "6867-4545-90"
}
}
]
}
}
(Explicitly writing "analyzer": "standard" is redundant since that is the default analyzer used if you do not specify one; I just wanted to make it obvious.)
On the other hand, if the text is embedded in such a way that the standard analysis doesn't provide what you want, say something like "121223234848" and you want to match on "4848", you will have to do something little more sophisticated, using ngrams. Here is an example of that (notice the difference in the data):
DELETE /test_index
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"field1":{
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"field2":{
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
}
}
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"field1": "121223234848","field2": "1234567890"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"field1": "000000000000","field2": "0987654321"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"field1": "111122223333","field2": "6867454590"}
POST test_index/_search
{
"query": {
"filtered": {
"filter": {
"or": [
{
"term": { "field1": "4848" }
},
{
"term": { "field2": "6867" }
}
]
}
}
}
}
...
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"field1": "121223234848",
"field2": "1234567890"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"field1": "111122223333",
"field2": "6867454590"
}
}
]
}
}
There is a lot going on here, so I won't attempt to explain it in this post. If you want more explanation I would encourage you to read this blog post: http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams. Hope you'll forgive the shameless plug. ;)
Hope that helps.

How to return nested documents and some of its fileds via a query over main document?

I have the following index on elasticsearch:
PUT /blog
{
"mappings": {
"threadQ":{
"properties": {
"title" : {
"type" : "string",
"analyzer" : "standard"
},
"body" : {
"type" : "string",
"analyzer" : "standard"
},
"posts":{
"type": "nested",
"properties": {
"comment": {
"type": "string",
"analyzer": "standard"
},
"prototype": {
"type": "string",
"analyzer": "standard"
},
"customScore":{
"type": "long"
}
}
}
}
}
}
}
And I added one document:
PUT /blog/threadQ/1
{
"title": "What is c#?",
"body": "C# is a good programming language, makes it easy to develop!",
"posts": [{
"comment": "YEP!",
"prototype": "Hossein Bakhtiari",
"customScore": 2
},
{
"comment": "NEVER EVER :O",
"prototype": "Garpizio En Larri",
"customScore": 3
}]
}
So the following query works:
POST /blog/threadQ/_search
{
"query": {
"bool": {
"must": [{
"nested": {
"query": {
"query_string": {
"fields": ["posts.comment"],
"query": "YEP"
}
},
"path": "posts"
}
}]
}
}
}
And the result is the document.
Now want to make a query like this:
SELECT threadQ.posts.customScore FROM threadQ WHERE threadQ.posts.comment = "YEP!"
Please tell me how I can implement it.
To return a specific field in the document either use the fields or _source parameters
Here _source is used
curl -XGET http://localhost:9200/blog/threadQ/_search -d '
{
"_source" : "posts.customScore",
"query": {
"bool": {
"must": [{
"nested": {
"query": {
"query_string": {
"fields": ["posts.comment"],
"query": "YEP"
}
},
"path": "posts"
}
}]
}
}
}'
it will return:
"hits" : {
"total" : 1,
"max_score" : 2.252763,
"hits" : [ {
"_index" : "myindex",
"_type" : "threadQ",
"_id" : "1",
"_score" : 2.252763,
"_source":{"posts":[{"customScore":2},{"customScore":3}]}
} ]
}
}
Finally the problem has been solved by dynamic templates. So the new index structure is like this:
PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"Id":{
"type": "integer",
"analyzer": "standard"
},
"name":{
"type": "string",
"analyzer": "english"
}
},
"dynamic_templates": [
{ "en": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"analyzer": "english"
}
}}
]
}}}
And the query:
POST /my_index/my_type/_search
{
"query": {
"function_score": {
"query": {"match_all": {}},
"functions": [
{
"script_score": {
"script": "doc.apple.value * _score"
}
}
]
}
}
}
And the result looks like this:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 14,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 14,
"_source": {
"Id": 2,
"name": "Second One",
"iphone": 20,
"apple": 14
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "3",
"_score": 14,
"_source": {
"Id": 3,
"name": "Third One",
"apple": 14
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"Id": 1,
"name": "First One",
"iphone": 2,
"apple": 1
}
}
]
}
}

Unexpected (case-insensitive) string sorting in Elasticsearch

I have a list of console platforms that I'm sorting in Elasticsearch.
Here is the mapping for the "name" field:
{
"name": {
"type": "multi_field",
"fields": {
"name": {
"type": "string",
"index": "analyzed"
},
"sort_name": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
When I execute the following query
{
"query": {
"match_all": {}
},
"sort": [
{
"name.sort_name": { "order": "asc" }
}
],
"fields": ["name"]
}
I get these results:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"failed": 0
},
"hits": {
"total": 17,
"max_score": null,
"hits": [
{
"_index": "platforms",
"_type": "platform",
"_id": "1393602489",
"_score": null,
"fields": {
"name": "GameCube"
},
"sort": [
"GameCube"
]
},
{
"_index": "platforms",
"_type": "platform",
"_id": "1393602490",
"_score": null,
"fields": {
"name": "Gameboy Advance"
},
"sort": [
"Gameboy Advance"
]
},
{
"_index": "platforms",
"_type": "platform",
"_id": "1393602498",
"_score": null,
"fields": {
"name": "Nintendo 3DS"
},
"sort": [
"Nintendo 3DS"
]
},
...remove for brevity ...
{
"_index": "platforms",
"_type": "platform",
"_id": "1393602493",
"_score": null,
"fields": {
"name": "Xbox 360"
},
"sort": [
"Xbox 360"
]
},
{
"_index": "platforms",
"_type": "platform",
"_id": "1393602502",
"_score": null,
"fields": {
"name": "Xbox One"
},
"sort": [
"Xbox One"
]
},
{
"_index": "platforms",
"_type": "platform",
"_id": "1393602497",
"_score": null,
"fields": {
"name": "iPhone/iPod"
},
"sort": [
"iPhone/iPod"
]
}
]
}
Everything is sorted as expected except the iPhone/iPod result is at the end (instead of after GameBoy Advance) - why does the / in the name have an effect on the sorting?
Thanks
Okay so I discovered the reason wasn't anything to do with the /. ES will sort by capital letters then lower case letters.
I added a custom analyzer to the settings of the index creation:
{
"analysis": {
"analyzer": {
"sortable": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
}
Then in the field mapping I added 'analyzer': 'sortable' to the sort_name multi field.
Use Normalizer with keyword to handle the sort
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-normalizers.html#analysis-normalizers
PUT index_name
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": ["quote"],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
Search query may be modified like this
{
"query": {
"match_all": {}
},
"sort": [
{
"name.sort_name": { "order": "asc" }
}
],
"fields": "name.keyword"
}
According to https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-normalizers.html (ElasticSearch 7.16) ...
Elasticsearch ships with a lowercase built-in normalizer.
So you can define an additional field (in the example below named "lowersortable"):
PUT /myindex/_mapping
{
"properties": {
"myproperty": {
"type": "text",
"fields": {
"lowersortable": {
"type": "keyword",
"normalizer": "lowercase"
}
}
}
}
}
... and use this field myproperty.lowersortable for sorting in the search query.

Resources