Elasticsearch query scores all documents 1.0. Why? - elasticsearch

I'm using ElasticSearch 2.4.1. When I execute the following query, all documents are scored 1.0. Why?
I get the same behavior if I remove the "bool" and just do a match on one field.
Query:
{
"query" :
{
"bool": {
"must" : [
{"match" : { "last" : { "query" : "SMITH" , fuzziness: 2.0}} }
],
"should" : [
{"match" : {"first" :{ "query" : "JOE", fuzziness: 1.0, boost: 99.0}}}
]
}
}
}
Explain for one match gives me:
1.0 = sum of:
1.0 = ConstantScore(+(last:1mith^0.8 last:1smith^0.8 last:4mith^0.8 last:amith^0.8 last:asmith^0.8 last:bsmith^0.8 last:csmith^0.8 last:dsmith^0.8 last:emith^0.8 last:esmith^0.8 last:fsmith^0.8 last:hmith^0.8 last:hsmith^0.8 last:imith^0.8 last:ismith^0.8 last:jmith^0.8 last:jsmith^0.8 last:ksmith^0.8 last:lsmith^0.8 last:msith^0.8 last:msmith^0.8 last:nsmith^0.8 last:omith^0.8 last:osmith^0.8 last:psmith^0.8 last:qsmith^0.8 last:rsmith^0.8 last:saith^0.8 last:samith^0.8 last:scmith^0.8 last:seith^0.8 last:shith^0.8 last:simith^0.8 last:simth^0.8 last:skith^0.8 last:slith^0.8 last:smaith^0.8 last:smath^0.8 last:smdith^0.8 last:smeth^0.8 last:smfith^0.8 last:smich^0.8 last:smidh^0.8 last:smidth^0.8 last:smieth^0.8 last:smigh^0.8 last:smiht^0.8 last:smiih^0.8 last:smiith^0.8 last:smith) (first:aoe^0.6666666 first:bjoe^0.6666666 first:boe^0.6666666 first:coe^0.6666666 first:djoe^0.6666666 first:doe^0.6666666 first:eoe^0.6666666 first:foe^0.6666666 first:goe^0.6666666 first:hoe^0.6666666 first:ioe^0.6666666 first:j0e^0.6666666 first:jae^0.6666666 first:jbe^0.6666666 first:jce^0.6666666 first:jee^0.6666666 first:jeo^0.6666666 first:jge^0.6666666 first:jhe^0.6666666 first:jhoe^0.6666666 first:jie^0.6666666 first:jioe^0.6666666 first:jke^0.6666666 first:jle^0.6666666 first:jme^0.6666666 first:jne^0.6666666 first:jnoe^0.6666666 first:joa^0.6666666 first:joae^0.6666666 first:job^0.6666666 first:jobe^0.6666666 first:joc^0.6666666 first:joce^0.6666666 first:jod^0.6666666 first:jode^0.6666666 first:joe first:joea^0.6666666 first:joeb^0.6666666 first:joec^0.6666666 first:joed^0.6666666 first:joee^0.6666666 first:joef^0.6666666 first:joeg^0.6666666 first:joeh^0.6666666 first:joei^0.6666666 first:joej^0.6666666 first:joek^0.6666666 first:joel^0.6666666 first:joem^0.6666666 first:joen^0.6666666)^99.0), product of:
1.0 = boost
1.0 = queryNorm
0.0 = match on required clause, product of:
0.0 = # clause
0.0 = weight(_type:mytype in 327) [], result of:
0.0 = score(doc=327,freq=1.0), with freq of:
1.0 = termFreq=1.0
Type mapping:
{
"ourindex1": {
"mappings": {
"people": {
"properties": {
"city": {
"type": "string"
},
"first": {
"type": "string"
},
"last": {
"type": "string"
},
"middle": {
"type": "string"
},
"state": {
"type": "string"
},
"street": {
"type": "string"
},
"suffix": {
"type": "string"
},
"suite": {
"type": "string"
},
"territory": {
"type": "string"
},
"zip5": {
"type": "string"
}
}
}
}
}
}
Edit: Simplified Reproduction:
Download clean version of elasticsearch 2.4.1 and start it up
Create new index with:
POST /newindex/people
{"first" : "JOE", "last": "SMITH", "street" : "1 FIRST STREET", "city" : "LOS ANGELES", "state" : "CA", "middle" : ""}
Issue the following query:
{ "query" : {"match" : { "last" : { "query" : "SMITHX", fuzziness: 1.0} } }}
When I do this, document returned is scored 1.0 and explain says something about ConstantScore.
Edit 2: It appears my reproduction steps included an unintentional lie
The library my app uses to communicate with elasticsearch (elastic4s), appears to mangle the query so that it becomes:
{"query" : { "query" : {"match" : { "last" : { "query" : "SMITHX", fuzziness: 1.0} } }}}
(Note that extra "query." This mangled query returns the results I'd expect, but with score = 1.0.) I thought I had already tried executing the query directly with curl, but evidently not.

This is happening because of double query keyword. So, basically it working like this - inner query selects hits and produce something like this:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.30685285,
"hits": [
{
"_index": "my_index",
"_type": "people",
"_id": "2",
"_score": 0.30685285,
"_source": {
"first": "JOHN",
"last": "SMITHS",
"street": "2 SECOND STREET",
"city": "LA",
"state": "CA",
"middle": ""
}
},
{
"_index": "my_index",
"_type": "people",
"_id": "1",
"_score": 0.30685282,
"_source": {
"first": "JOE",
"last": "SMITH",
"street": "1 FIRST STREET",
"city": "LOS ANGELES",
"state": "CA",
"middle": ""
}
}
]
}
}
which is fully correct response with proper score, but then the second query appears, which didn't change result set, but only "eat" the score and replace it with 1.0. So, you need to fix your usage of elastic4s

Related

ElasticSearch sorting by more conditions

I have index with simple data and I have to filter and sort it like this:
Records are like this:
{
"name": "Product ABC variant XYZ subvariant JKL",
"date": "2023-01-03T10:34:39+01:00"
}
And I'm searching name, where it is: "Product FGH"
Get records with exact match (field name) and sort them by date (field date) DESC
if nothing found in 1) or if there is not exact match, but similar records, then the rest records sort by default score.
Is it possible to do it in one elasticsearch request? And how it should look the whole query?
Thanks
What you are looking for is running Elasticsearch queries based on the conditions, which is not possible in a single query, you need to first fire first query and if it doesn't return any hit, you need to fire the second one.
Using script_query, you can do it how you want. Convert the date to milliseconds and assign it to the "_score" field for an exact match. for non exact match, you can simply return _score field
For an exact match, it will be sorted by date field desc.
For non exact match, it will sorted by _score field
For example:
Mapping:
{
"mappings": {
"properties": {
"name" : {"type": "keyword"},
"date" : {"type": "date", "format": "yyyy-MM-dd HH:mm:ss"}
}
}
}
Insert:
PUT func/_doc/1
{
"name" : "Product ABC variant XYZ subvariant JKL",
"date" : "2023-01-03 10:34:39"
}
PUT func/_doc/2
{
"name" : "Product ABC variant XYZ subvariant JKL",
"date" : "2022-12-03 10:33:39"
}
PUT func/_doc/3
{
"name" : "Product ABC",
"date" : "2022-11-03 10:33:39"
}
PUT func/_doc/4
{
"name" : "Product ABC",
"date" : "2023-01-03 10:33:39"
}
Query:
GET /func/_search
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "if (doc['name'].value == params.search_term) { return doc['date'].value.toInstant().toEpochMilli(); } else return _score",
"params": {
"search_term": "Product ABC"
}
}
}
}
}
output:
{
"took": 29,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 1672742040000,
"hits": [
{
"_index": "func",
"_id": "4",
"_score": 1672742040000,
"_source": {
"name": "Product ABC",
"date": "2023-01-03 10:33:39"
}
},
{
"_index": "func",
"_id": "3",
"_score": 1667471640000,
"_source": {
"name": "Product ABC",
"date": "2022-11-03 10:33:39"
}
},
{
"_index": "func",
"_id": "1",
"_score": 1,
"_source": {
"name": "Product ABC variant XYZ subvariant JKL",
"date": "2023-01-03 10:34:39"
}
},
{
"_index": "func",
"_id": "2",
"_score": 1,
"_source": {
"name": "Product ABC variant XYZ subvariant JKL",
"date": "2022-12-03 10:33:39"
}
}
]
}
}

Query on Elastic Search on multiple criterias

I have this document in elastic search
{
"_index" : "master",
"_type" : "_doc",
"_id" : "q9IGdXABeXa7ITflapkV",
"_score" : 0.0,
"_source" : {
"customer_acct" : "64876457056",
"ssn_number" : "123456789",
"name" : "Julie",
"city" : "NY"
}
I wanted to query the master index , with the customer_acct and ssn_number to retrive the entire document. I wanted to disable scoring and relevance , I have used the below query
curl -X GET "localhost/master/_search/?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"term": {
"customer_acct": {
"value":"64876457056"
}
}
}
}'
I need to include the second criteria in the term query as well which is the ssn_number, how would I do that? , I want to turn off scoring and relevance would that be possible, I am new to Elastic Search and how would I fit the second criteria on ssn_number in the above query that I have tried?
First, you need to define the proper mapping of your index. your customer_acct and ssn_number are of numeric type but you are storing it as a string. Also looking at your sample I can see you have to use long to store them. and then you can just use filter context in your query as you don't need score and relevance in your result. Read more about filter context in official ES doc as well as below snippet from the link.
In a filter context, a query clause answers the question “Does this
document match this query clause?” The answer is a simple Yes or
No — no scores are calculated. Filter context is mostly used for
filtering structured data,
which is exactly your use-case.
1. Index Mapping
{
"mappings": {
"properties": {
"customer_acct": {
"type": "long"
},
"ssn_number" :{
"type": "long"
},
"name" : {
"type": "text"
},
"city" :{
"type": "text"
}
}
}
}
2. Index sample docs
{
"name": "Smithe John",
"city": "SF",
"customer_acct": 64876457065,
"ssn_number": 123456790
}
{
"name": "Julie",
"city": "NY",
"customer_acct": 64876457056,
"ssn_number": 123456789
}
3. Main search query to filter without the score
{
"query": {
"bool": {
"filter": [ --> only filter clause
{
"term": {
"customer_acct": 64876457056
}
},
{
"term": {
"ssn_number": 123456789
}
}
]
}
}
}
Above search query gives below result:
{
"took": 186,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "so-master",
"_type": "_doc",
"_id": "1",
"_score": 0.0, --> notice score is 0.
"_source": {
"name": "Smithe John",
"city": "SF",
"customer_acct": 64876457056,
"ssn_number": 123456789
}
}
]
}
}

Elasticsearch advanced autocomplete

I want to autocomplete user input with Elasticsearch. Now There are tons of tutorials out there how to do so, but none go into the really detailed stuff.
The last issue I'm having with my query is that it should score Results that are not real "autocompletions" lower. Example:
IS:
I type: "Bed"
I find: "Bed", "Bigbed", "Fancy Bed", "Bed Frame"
WANT:
I type: "Bed"
I find: "Bed", "Bed Frame", [other "Bed XXX" results], "Fancy Bed", "Bigbed"
So i want Elasticsearch to first complete "to the right" if that makes sense. And then use results that have words in front of it.
I've tried the completion suggester I doesn't do other stuff I want but also has the same issue.
In German there are lots of examples of words like Bigbed (which isn't a real word in English, I know. But I don't want those words as high results. But since they match closer than Bed Frame (because that is 2 Tokens) they show up so high.
This is my query currently:
POST autocompletion/_search?pretty
{
"query": {
"function_score": {
"query": {
"match": {
"keyword": {
"query": "Bed",
"fuzziness": 1,
"minimum_should_match": "100%"
}
}
},
"field_value_factor": {
"field": "bias",
"factor": 1
}
}
}
}
If you use elasticsearch completion suggester, as explained at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html, when querying like:
{
"suggest": {
"song-suggest" : {
"prefix" : "bed",
"completion" : {
"field" : "suggest"
}
}
}
}
You will get:
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": 0.0,
"hits": []
},
"suggest": {
"song-suggest": [
{
"text": "bed",
"offset": 0,
"length": 3,
"options": [
{
"text": "Bed",
"_index": "autocomplete",
"_type": "_doc",
"_id": "1",
"_score": 34.0,
"_source": {
"suggest": {
"input": [
"Bed"
],
"weight": 34
}
}
},
{
"text": "Bed Frame",
"_index": "autocomplete",
"_type": "_doc",
"_id": "3",
"_score": 34.0,
"_source": {
"suggest": {
"input": [
"Bed Frame"
],
"weight": 34
}
}
}
]
}
]
}
}
If you want to use the search API instead, you can use 2 queries:
prefix query "bed ****"
with a term starting by "bed"
Here the mapping:
{
"mappings": {
"_doc" : {
"properties" : {
"suggest" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
}
}
Here the search query:
{
"query" : {
"bool" : {
"must" : [
{
"match" : {
"suggest" : "Bed"
}
}
],
"should" : [
{
"prefix" : {
"suggest.keyword" : "Bed"
}
}
]
}
}
}
The should clause will boost document starting by "Bed". Et voilà!

Cross Field Search with Multiple Complete and Incomplete Phrases in Each Field

Example data:
PUT /test/test/1
{
"text1":"cats meow",
"text2":"12345",
"text3":"toy"
}
PUT /test/test/2
{
"text1":"dog bark",
"text2":"98765",
"text3":"toy"
}
And an example query:
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "meow cats toy",
"type" : "cross_fields"
}
}
}
Returns the cat hit first and then the dog, which is what I want.
BUT when you query cat toy, both the cat and dog have the same relevance score. I want to be able to take into consideration the prefix of that word (and maybe a few other words inside that field), and run cross_fields.
So if I search:
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "cat toy",
"type" : "phrase_prefix"
}
}
}
or
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "meow cats",
"type" : "phrase_prefix"
}
}
}
I should get the cat/ID 1, but I did not.
I found that using cross_fields achieves multi-word phrases, but not multi-incomplete phrases. And phrase_prefix achieves incomplete phrases, but not multiple incomplete phrases...
Sifting through the documentation really isn't helping me discover how to combine these two.
Yeah, I had to apply an analyzer...
The analyzer is applied to the fields when creating the index before you add any data. I couldn't find an easier way to do this after you add the data.
The solution I have found is exploding all of the phrases into each individual prefixes so cross_fields can do it's magic. You can learn more about the use of edge-ngram here.
So instead of cross_field just searching the cats phrase, it's now going to search: c, ca, cat, and cats and every phrase after... So the text1 field will look like this to elastic: c ca cat cats m me meo meow.
~~~
Here are the steps to make the above question example work:
First you create and name the analyzer. To learn a bit more what the filter's values mean, I recommend you take a look at this.
PUT /test
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Then I attached this analyzer to each field.
I changed the text1 to match the field I was applying this to.
PUT /test/_mapping/test
{
"test": {
"properties": {
"text1": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
}
I ran GET /test/_mapping to be sure everything worked.
Then to add the data:
POST /test/test/_bulk
{ "index": { "_id": 1 }}
{ "text1": "cats meow", "text2": "12345", "text3": "toy" }
{ "index": { "_id": 2 }}
{ "text1": "dog bark", "text2": "98765", "text3": "toy" }
And the search!
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "cat toy",
"type" : "cross_fields"
}
}
}
Which returns:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.70778143,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.70778143,
"_source": {
"text1": "cats meow",
"text2": "12345",
"text3": "toy"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.1278426,
"_source": {
"text1": "dog bark",
"text2": "98765",
"text3": "toy"
}
}
]
}
}
This creates contrast between the two when you search cat toy, where as before the score was the same. But now, the cat hit has a higher score, as it should. This is achieved by taking into consideration every prefix (max 20 characters in this case/phrase) for each phrase and then seeing how relevant the data is with cross_fields.

Elastic Search Won't Match For Arrays

I'm trying to search a document with the following structure:
{
"_index": "XXX",
"_type": "business",
"_id": "1252809",
"_score": 1,
"_source": {
"url": "http://Samuraijapanese.com",
"raw_name": "Samurai Restaurant",
"categories": [
{
"name": "Cafe"
},
{
"name": "Cajun Restaurant"
},
{
"name": "Candy Stores"
}
],
"location": {
"lat": "32.9948649",
"lon": "-117.2528171"
},
"address": "979 Lomas Santa Fe Dr",
"zip": "92075",
"phone": "8584810032",
"short_name": "samurai-restaurant",
"name": "Samurai Restaurant",
"apt": "",
"state": "CA",
"stdhours": "",
"city": "Solana Beach",
"hours": "",
"yelp": "",
"twitter": "",
"closed": 0
}
}
Searching it for url, raw_name, address, etc, all work, but searching the categories returns nothing. I'm trying to search like so: If I switch anything else in for categories.name it works:
"query": {
"filtered" : {
"filter" : {
"geo_distance" : {
"location" : {
"lon" : "-117.15726",
"lat" : "32.71533"
},
"distance" : "5mi"
}
},
"query" : {
"multi_match" : {
"query" : "Cafe",
"fields" : [
"categories.name"
]
}
}
}
},
"sort": [
{
"_score" : {
"order" : "desc"
}
},
{
"_geo_distance": {
"location": {
"lat": 32.71533,
"lon": -117.15726
},
"order": "asc",
"sort_mode": "min"
}
}
],
"script_fields": {
"distance_from_origin": {
"script": "doc['location'].arcDistanceInKm(32.71533,-117.15726)"
}
},
"fields": ["_source"],
"from": 0,
"size": 10
}
If I switch out, for example, categories.name with address, and change the search term to Lomas, it returns the result
Without seeing your type mapping I can't answer definitively, but I would guess you have mapped categories as nested. When querying sub-documents of type nested (opposed to object) you have to use a nested query.

Resources