Custom scoring function in Elasticsearch does not return expected field value - elasticsearch

I create a custom scoring function for my documents that just returns the value of the field a for each document. But for some reason, in the example below, the last digits of the _score in the results differ from the last digits of the value of a for each document. What is happening here?
PUT test/doc/1
{
"a": 851459198
}
PUT test/doc/2
{
"a": 984968088
}
GET test/_search
{
"query": {
"function_score": {
"script_score": {
"script": {
"inline": "doc[\"a\"].value"
}
}
}
}
}
That will return the following:
{
"took": 16,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 984968060,
"hits": [
{
"_index": "test",
"_type": "doc",
"_id": "2",
"_score": 984968060,
"_source": {
"a": 984968088
}
},
{
"_index": "test",
"_type": "doc",
"_id": "1",
"_score": 851459200,
"_source": {
"a": 851459198
}
}
]
}
}
Why is the _score different than the value of the field a?
I'm using Elasticsearch 2.1.1

The _score value is internally hard coded as a float which can only accurately represent integers up to the value 134217728. Therefore, if you want to make use, in the scoring function, of a field stored as a number larger than that, it will overflow the buffer and be truncated. See this github issue

Related

Searching within a percentile in elastic [duplicate]

Say if I want to filter documents by some field within 10th to 20th percentile. I'm wondering if it's possible by some simple query, something like {"fieldName":{"percentile": [0.1, 0.2]}}.
Say I have these documents:
[{"a":1,"b":101},{"a":2,"b":102},{"a":3,"b":103}, ..., {"a":100,"b":200}]
I need to filter the top 10th of them by a (with ascending order), that would be a from 1 to 10. Then I need to sort those results by b with descending order, then take the paginated result (like page No.2, with 10 items every page).
One solution in mind would be:
get the total count of the documents.
sort the documents by a, take the corresponding _id with limit 0.1 * total_count
write the final query, something like id in (...) order by b
But the shortcomings are pretty obvious too:
seems not effecient if we're talking about subsecond latency
the second query might not work if we have too many _id returned in the first query (ES only allows 1000 by default. I can change the config of course, but there's always a limit).
I doubt that there is a way to do this in one query if the exact values of a are not known beforehand, although I think one pretty efficient approach is feasible.
I would suggest to do a percentiles aggregation as first query and range query as second.
In my sample index I have only 14 documents, so for explanatory reasons I will try to find those documents that are from 30% to 60% of field a and sort them by field b in inverse order (so to be sure that sort worked).
Here are the docs I inserted:
{"a":1,"b":101}
{"a":5,"b":105}
{"a":10,"b":110}
{"a":2,"b":102}
{"a":6,"b":106}
{"a":7,"b":107}
{"a":9,"b":109}
{"a":4,"b":104}
{"a":8,"b":108}
{"a":12,"b":256}
{"a":13,"b":230}
{"a":14,"b":215}
{"a":3,"b":103}
{"a":11,"b":205}
Let's find out which are the bounds for field a between 30% and 60% percentiles:
POST my_percent/doc/_search
{
"size": 0,
"aggs" : {
"percentiles" : {
"percentiles" : {
"field" : "a",
"percents": [ 30, 60, 90 ]
}
}
}
}
With my sample index it looks like this:
{
...
"hits": {
"total": 14,
"max_score": 0,
"hits": []
},
"aggregations": {
"percentiles": {
"values": {
"30.0": 4.9,
"60.0": 8.8,
"90.0": 12.700000000000001
}
}
}
}
Now we can use the boundaries to do the range query:
POST my_percent/doc/_search
{
"query": {
"range": {
"a" : {
"gte" : 4.9,
"lte" : 8.8
}
}
},
"sort": {
"b": "desc"
}
}
And the result is:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": null,
"hits": [
{
"_index": "my_percent",
"_type": "doc",
"_id": "vkFvYGMB_zM1P5OLcYkS",
"_score": null,
"_source": {
"a": 8,
"b": 108
},
"sort": [
108
]
},
{
"_index": "my_percent",
"_type": "doc",
"_id": "vUFvYGMB_zM1P5OLWYkM",
"_score": null,
"_source": {
"a": 7,
"b": 107
},
"sort": [
107
]
},
{
"_index": "my_percent",
"_type": "doc",
"_id": "vEFvYGMB_zM1P5OLRok1",
"_score": null,
"_source": {
"a": 6,
"b": 106
},
"sort": [
106
]
},
{
"_index": "my_percent",
"_type": "doc",
"_id": "u0FvYGMB_zM1P5OLJImy",
"_score": null,
"_source": {
"a": 5,
"b": 105
},
"sort": [
105
]
}
]
}
}
Note that the results of percentiles aggregation are approximate.
In general, this looks like a task better solved by pandas or a Spark job.
Hope that helps!

Function score ignored

I have two nearly identical documents, one of which has the fields CONSTRUCTION: 1 and EDUCATION: 0.1, the other with CONSTRUCTION: 0.1 and EDUCATION: 1. I want to be able to sort results by the value of either the CONSTRUCTION or EDUCATION field
GET /objects/_search
{
"query": {
"function_score": {
"query": {
"match": {
"name": {
"query": "Monkeys"
}
}
},
"field_value_factor": {
"field" : "CONSTRUCTION",
"missing": 1
}
}
},
"_source": ["name", "CONSTRUCTION", "EDUCATION"]
}
Returns the incorrect results:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.7622693,
"hits": [
{
"_index": "objects__feed_id_key_pages__date_2019-12-10__timestamp_1575988952__batch_id_3gpnz7fc__",
"_type": "_doc",
"_id": "dit:greatDomesticUi:KeyPages:12",
"_score": 1.7622693,
"_source": {
"CONSTRUCTION": 0.1,
"name": "Space Monkeys - education",
"EDUCATION": 1
}
},
{
"_index": "objects__feed_id_key_pages__date_2019-12-10__timestamp_1575988952__batch_id_3gpnz7fc__",
"_type": "_doc",
"_id": "dit:greatDomesticUi:KeyPages:11",
"_score": 1.0226655,
"_source": {
"CONSTRUCTION": 1,
"name": "Space Monkeys - construction",
"EDUCATION": 0.1
}
}
]
}
}
This only always returns the same results. Indeed if you misspell the field_value_factor field, you get the same score "field_value_factor": { "field" : "WHATEVER",... }. This suggests the field simply isn't being read.
Dynamic mapping was turned off. The EDUCATION and CONSTRUCTION fields were not mapped. Mystery solved!

Finding multiple Elasticsearch documents with same ids, different types

I need to find out if any document with a certain id was already indexed in my ES database, so that I can delete them before indexing a new document.
The trouble is I do not know a priori the type it was indexed as.
I found the _mget query which sounds like it could be what I need, but then this quote in the documentation says I only get 1 (random) hit when searching
If you don’t set the type and have many documents sharing the same
_id, you will end up getting only the first matching document.
how can I get this behaviour; finding all documents sharing an _id, possibly > 1 with different _type in the same index without an expensive _search query?
thanks!
A simple term query on "_id" worked for me.
So I created a trivial index and added two documents each, for two different types:
PUT /test_index
POST /test_index/_bulk
{"index":{"_type":"type1","_id":1}}
{"name":"type1 doc1"}
{"index":{"_type":"type1","_id":2}}
{"name":"type1 doc2"}
{"index":{"_type":"type2","_id":1}}
{"name":"type2 doc1"}
{"index":{"_type":"type2","_id":2}}
{"name":"type2 doc2"}
And this query will return both documents with id 1:
POST /test_index/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"_id": "1"
}
}
}
}
}
...
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "type1",
"_id": "1",
"_score": 1,
"_source": {
"name": "type1 doc1"
}
},
{
"_index": "test_index",
"_type": "type2",
"_id": "1",
"_score": 1,
"_source": {
"name": "type2 doc1"
}
}
]
}
}
Here's the code I used:
http://sense.qbox.io/gist/a8085b57c22631148dd4c67769307caf6425fd95

How to filter out elements from an array that doesn’t match the query?

stackoverflow won't let me write that much example code so I put it on gist.
So I have this index
with this mapping
here is a sample document I insert into newly created mapping
this is my query
GET products/paramSuggestions/_search
{
"size": 10,
"query": {
"filtered": {
"query": {
"match": {
"paramName": {
"query": "col",
"operator": "and"
}
}
}
}
}
}
this is the unwanted result I get from previous query
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.33217794,
"hits": [
{
"_index": "products",
"_type": "paramSuggestions",
"_id": "1",
"_score": 0.33217794,
"_source": {
"productName": "iphone 6",
"params": [
{
"paramName": "color",
"value": "white"
},
{
"paramName": "capacity",
"value": "32GB"
}
]
}
}
]
}
}
and finally the wanted result, how I want the query result to look like
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.33217794,
"hits": [
{
"_index": "products",
"_type": "paramSuggestions",
"_id": "1",
"_score": 0.33217794,
"_source": {
"productName": "iphone 6",
"params": [
{
"paramName": "color",
"value": "white"
},
]
}
}
]
}
}
How should the query look like to achieve the wanted result with filtered array field which matches the query? In other words, all other non-matching array items should not appear in the final result.
The final result is the _source document that you indexed. There is no feature that lets you mask field elements of your document out of the Elasticsearch response.
That said, depending on your goal, you can look into how Highlighters and Suggesters identify result terms matching the query, or possibly, roll-your-own client-side masking using info returned from setting "explain": true in your query.

Does an empty field in a document take up space in elasticsearch?

Does an empty field in a document take up space in elasticsearch?
For example, in the case below, is the total amount of space used to store the document the same in Case A as in Case B (assuming the field "colors" is defined in the mapping).
Case A
{"features":
"price": 1,
"colors":[]
}
Case B
{"features":
"price": 1,
}
If you keep the default settings, the original document is stored in the _source field, there will be a difference as the original document of case A is bigger than case B.
Otherwise, there should be no difference : for case A, no term is added in the index for the colors field as it's empty.
You can use the _size field to see the size of the original document indexed, which is the size of the _source field :
POST stack
{
"mappings":{
"features":{
"_size": {"enabled":true, "store":true},
"properties":{
"price":{
"type":"byte"
},
"colors":{
"type":"string"
}
}
}
}
}
PUT stack/features/1
{
"price": 1
}
PUT stack/features/2
{
"price": 1,
"colors": []
}
POST stack/features/_search
{
"fields": [
"_size"
]
}
The last query will output this result, which shows than document 2 takes more space than 1:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "stack",
"_type": "features",
"_id": "1",
"_score": 1,
"fields": {
"_size": 16
}
},
{
"_index": "stack",
"_type": "features",
"_id": "2",
"_score": 1,
"fields": {
"_size": 32
}
}
]
}
}

Resources