Is it possible to sort by a range in Elasticsearch? - sorting

When I execute the following query:
{
"query": {
"bool": {
"filter": [
{
"match": {
"my_value": "hi"
}
},
{
"range": {
"my_range": {
"gt": 0,
"lte": 200
}
}
}
]
}
},
"sort": {
"my_range": {
"order": "asc",
"mode": "min"
}
}
}
I get the error:
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is not supported on field [my_range] of type [long_range]"
}
How can I enable a range datatype to be sortable? Is this possible?
Elasticsearch version: 5.4, but I am wondering if this is possible with ANY version.
More context
Not all documents in the alias/index have the range field. However, the query filters to only include documents with that field.

It is not straight-forward to sort using a field of range data type. Still you can use script based sorting to some extent to get the expected result.
e.g. For simplicity of script I'm assuming for all your docs, the data indexed against my_range field has data for gt and lte only and you want to sort based on the minimum values of the two then you can add the below for sorting:
{
"query": {
"bool": {
"filter": [
{
"match": {
"my_value": "hi"
}
},
{
"range": {
"my_range": {
"gt": 0,
"lte": 200
}
}
}
]
}
},
"sort": {
"_script": {
"type": "number",
"script": {
"lang": "painless",
"inline": "Math.min(params['_source']['my_range']['gt'], params['_source']['my_range']['lte'])"
},
"order": "asc"
}
}
}
You can modify the script as per your needs for complex data involving combination of all lt, gt, lte, gte.
Updates (Scripts for other different use cases):
1. Sort by difference
"Math.abs(params['_source']['my_range']['gt'] - params['_source']['my_range']['lte'])"
2. Sort by gt
"params['_source']['my_range']['gt']"
3. Sort by lte
"params['_source']['my_range']['lte']"
4. Sorting if query returns few docs which don't have range field
"if(params['_source']['my_range'] != null) { <sorting logic> } else { return 0; }"
Replace <sorting logic> with the required logic of sorting (which can be one of the 3 above or the one in the query)
return 0 can be replace by return -1 or anything other number as per the sorting needs

I think what you are looking for is sort based on the difference of the range coz I'm not sure if simply sorting on any of the range values would make any sense.
For e.g. if range for one document is 100, 300 and another 200, 600 then you would want to sort based on the difference for e.g. you would want the lesser range to be appearing i.e 300-100 = 200 to be appearing at the top.
If so, I've made use of the below painless script and implemented script based sorting.
Sorting based on difference in Range
POST <your_index_name>/_search
{
"query":{
"match_all":{
}
},
"sort":{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"params._source.my_range.lte-params._source.my_range.gte"
},
"order":"asc"
}
}
}
Note that in this case, sort won't be based on any of the field values of my_range but only on their differences. If you want to further sort based on the fields like lte, lt, gte or gt you can have your sort implemented with multiple script as below:
Sorting based on difference in Range + Range Field (my_range.lte)
POST <your_index_name>/_search
{
"query":{
"match_all":{
}
},
"sort":[
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"params._source.my_range.lte - params._source.my_range.gte"
},
"order":"asc"
}
},
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"params._source.my_range.lte"
},
"order":"asc"
}
}
]
}
So in this case even if for two documents, ranges are same, the one with the lesser my_range.lte would be showing up first.
Sort based on range field
However if you simply want to sort based on one of the range values, you can make use of below query.
POST <your_index_name>/_search
{
"query":{
"match_all":{
}
},
"sort":{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"params._source.my_range.lte"
},
"order":"asc"
}
}
}
Updated Answer to manage documents without range
This is for the scenario, Sort based on difference in range + Range.lte or Range.lt whichever is present
The below code what it does is,
Checks if the document has my_range field
If it doesn't have, then by default it would return Long.MAX_VALUE. This would mean if you sort by asc, this document should returned
last.
Further it would check if document has lte or lt and uses that value as high. Note that default value of high is Long.MAX_VALUE.
Similarly it would check if document has gte or gt and uses that value as low. Default value of low would be 0.
Calculate now high - low value on which sorting would be applied.
Updated Query
POST <your_index_name>/_search
{
"size":100,
"query":{
"match_all":{
}
},
"sort":[
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"""
if(params._source.my_range==null){
return Long.MAX_VALUE;
} else {
long high = Long.MAX_VALUE;
long low = 0L;
if(params._source.my_range.lte!=null){
high = params._source.my_range.lte;
} else if(params._source.my_range.lt!=null){
high = params._source.my_range.lt;
}
if(params._source.my_range.gte!=null){
low = params._source.my_range.gte;
} else if (params._source.my_range.gt==null){
low = params._source.my_range.gt;
}
return high - low;
}
"""
},
"order":"asc"
}
},
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"""
if(params._source.my_range==null){
return Long.MAX_VALUE;
}
long high = Long.MAX_VALUE;
if(params._source.my_range.lte!=null){
high = params._source.my_range.lte;
} else if(params._source.my_range.lt!=null){
high = params._source.my_range.lt;
}
return high;"""
},
"order":"asc"
}
}
]
}
This should work with ES 5.4. Hope it helps!

This can be resolved easily by using the regex interval filter :
Interval The interval option enables the use of numeric ranges,
enclosed by angle brackets "<>". For string: "foo80":
foo<1-100> # match
foo<01-100> # match
foo<001-100> # no match
Enabled with the INTERVAL or ALL flags.
Elactic docs
{
"query": {
"bool": {
"filter": [
{
"match": {
"my_value": "hi"
}
},
{
"regexp": {
"my_range": {
"value": "<0-200>"
}
}
}
]
}
},
"sort": {
"my_range": {
"order": "asc",
"mode": "min"
}
}
}

Related

ElasticSearch _knn_search query on multiple fields

I'm using ES 8.2. I'd like to use approximate method of _knn_search on more than 1 vector. Below I've attached my current code searching on a single vector. So far as I've read _knn_search does not support search on nested fields.
Alternatively, I can use multi index search. One index, one vector, one search, sum up all results together. However, I need to store all these vectors together in one index as I need also to perform filtration on some other fields besides vectors for knn search.
Thus, the question is if there is a work around how I can perform _knn_search on more than 1 vector?
search_vector = np.zeros(512).tolist()
es_query = {
"knn": {
"field": "feature_vector_1.vector",
"query_vector": search_vector,
"k": 100,
"num_candidates": 1000
},
"filter": [
{
"range": {
"feature_vector_1.match_prc": {
"gt": 10
}
}
}
],
"_source": {
"excludes": ["feature_vector_1.vector", "feature_vector_2.vector"]
}
}
The last working query that I've end up with is
es_query = {
"knn": {
"field": "feature_vector_1.vector",
"query_vector": search_vector,
"k": 1000,
"num_candidates": 1000
},
"filter": [
{
"function_score": {
"query": {
"match_all": {}
},
"script_score": {
"script": {
"source": """
double value = dotProduct(params.queryVector, 'feature_vector_2.vector');
return 100 * (1 + value) / 2;
""",
"params": {
"queryVector": search_vector
}
},
}
}
}
],
"_source": {
"excludes": ["feature_vector_1.vector", "feature_vector_2.vector"]
}
}
However, it is not true AKNN on 2 vectors but still working option if performance of such query satisfies your expectations.
the below seems to be working for me for combining KNN searches, taking the average of multiple cosine similarity scores. Note that this is a little different than the original request, since it performs a brute force search, but you can still filter the results up front by replacing the match_all bit.
GET my-index/_search
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "(cosineSimilarity(params.vector1, 'my-vector1') + cosineSimilarity(params.vector2, 'my-vector2'))/2 + 1.0",
"params": {
"vector1": [
1.3012068271636963,
...
0.23468133807182312
],
"vector2": [
-0.49404603242874146,
...
-0.15835021436214447
]
}
}
}
}
}

Compute percentile with collapsing by user

Let says I have an index where I save a million of tweets (original object). I want to get the 90th percentile users based on the number of followers.
I know there is the aggregation "percentile" to do this, but my problem is that ElasticSearch use all documents so I have some users that tweet a lot who noise my calculation.
I want to isolate all unique user then compute the 90th.
The other constraint is that I want to do this in only one or two requests to keep the response lower than 500ms.
I have tried a lot of things and I was able to do this with "scripted_metric" but when my dataset exceed 100k of tweets the performances go down criticaly.
Any advice ?
Additionnal infos :
My index store orginal tweets & retweets based on user search queries
The index is mapped with a dynamic template mapping (No problem with this)
The index contains approximatly 100M
Unfortunately, "top hits" aggregation doesn't accept sub-aggs.
The request I try to achieve is :
{
"collapse": {
"field": "user.id" <--- I want this effect on aggregation
},
"query": {
"bool": {
"must": [
{
"term": {
"metadatas.clientId": {
"value": projectId
}
}
},
{
"match": {
"metadatas.blacklisted": false
}
}
],
"filter": [
{
"range": {
"publishedAt": {
"gte": "now-90d/d"
}
}
}
]
}
},
"aggs":{
"twitter": {
"percentiles": {
"field": "user.followers_count",
"percents": [95]
}
}
},
"size": 0
}
Finally, I figure out to find a workaround.
In percentile aggregation, I can use a script. I use params variable to hold unique keys then return preceding _score.
Without the complete explanation of the computation, I cannot fine tune the behavior of my script. But the result is good enough for me.
"aggs": {
"unique":{
"cardinality": {
"field": "collapse_profile"
}
},
"thresholds":{
"percentiles": {
"field": "user.followers_count",
"percents": [90],
"script": {
"source": """
if(params.keys == null){
params.keys = new HashMap();
}
def key = doc['user.id'].value;
def value = doc['user.followers_count'].value;
if(params.keys[key] == null){
params.keys[key] = _score;
return value;
}
return _score;
""",
"lang": "painless"
}
}
}
}

ElasticSearch 'bucket_script' not executing when one of the bucket paths resolves to 'null'

Let's assume there's a (simplified) index:
PUT test
{
"mappings": {
"properties": {
"numeric_field_sometimes_empty": {
"type": "integer"
},
"numeric_field_always_present": {
"type": "integer"
}
}
}
}
with numeric fields that may or may not be present in some (potentially all filtered) docs:
POST test/_doc
{
"numeric_field_always_present": 10
}
POST test/_doc
{
"numeric_field_always_present": 20
}
I'm wanting to execute a bucket_script to calculate certain trends and since bucket_script needs to be a child of a multi-bucket agg, I simulate one using filters. After that nothing stands in the way of creating my numeric single-bucket sub-aggs like so:
GET test/_search
{
"size": 0,
"aggs": {
"multibucket_simulator": {
"filters": {
"filters": {
"all": {
"match_all": {}
}
}
},
"aggs": {
"avg_empty": {
"avg": {
"field": "numeric_field_sometimes_empty"
}
},
"avg_non_null": {
"avg": {
"field": "numeric_field_always_present"
}
},
"diff": {
"bucket_script": {
"buckets_path": {
"now": "avg_empty.value",
"before": "avg_non_null.value"
},
"script": """
return (params.now != null ? params.now : 0)
- (params.before != null ? params.before : 0)
""",
"format": "###.##"
}
}
}
}
}
}
Since I know that some of those sub-aggs' results may be null (a strict null type, not 0), I check whether that was the case with the ternary operators and proceed to return the value difference. This yields:
{
"aggregations":{
"multibucket_simulator":{
"buckets":{
"all":{
"doc_count":2,
"avg_non_null":{
"value":15.0
},
"avg_empty":{
"value":null
}
}
}
}
}
}
with the diff bucket script sub-agg being fully left out. That's suboptimal...
I've tried removing .value from the paths so I can access and check .value directly in the script -- to no avail.
The question is then -- why are the null buckets being skipped and, also, is there an alternative besides bucket_script for this use case?
The docs state that
The specified metric must be numeric and the script must return a
numeric value.
I suppose it's a matter of discussion whether null falls into that category.
EDIT 1
With that being said, setting a missing parameter on each avg agg solves this problem.
EDIT 2 & Actual Anwer
It was the gap_policy. It defaults to skip and needed to be set to insert_zeros. Here's the reason.

Elasticsearch ordering by field value which is not in the filter

can somebody help me please to make a query which will order result items according some field value if this field is not part of query in request. I have a query:
{
"_source": [
"ico",
"name",
"city",
"status"
],
"sort": {
"_score": "desc",
"status": "asc"
},
"size": 20,
"query": {
"bool": {
"should": [
{
"match": {
"normalized": {
"query": "idona",
"analyzer": "standard",
"boost": 3
}
}
},
{
"term": {
"normalized2": {
"value": "idona",
"boost": 2
}
}
},
{
"match": {
"normalized": "idona"
}
}
]
}
}
}
The result is sorted according field status alphabetically ascending. Status contains few values like [active, canceled, old....] and I need something like boosting for every possible values in query. E.g. active boost 5, canceled boost 4, old boost 3 ........... Is it possible to do it? Thanks.
You would need a custom sort using script to achieve what you want.
I've just made use of generic match_all query for my query, you can probably go ahead and add your query logic there, but the solution that you are looking for is in the sort section of the below query.
Make sure that status is a keyword type
Custom Sorting Based on Values
POST <your_index_name>/_search
{
"query":{
"match_all":{
}
},
"sort":[
{ "_score": "desc" },
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"if(params.scores.containsKey(doc['status'].value)) { return params.scores[doc['status'].value];} return 100000;",
"params":{
"scores":{
"active":5,
"old":4,
"cancelled":3
}
}
},
"order":"desc"
}
}
]
}
In the above query, go ahead and add the values in the scores section of the query. For e.g. if your value is new and you want it to be at say value 2, then your scores would be in the below:
{
"scores":{
"active":5,
"old":4,
"cancelled":3,
"new":6
}
}
So basically the documents would first get sorted by _score and then on that sorted documents, the script sort would be executed.
Note that the script sort is desc by nature as I understand that you would want to show active documents at the top, followed by other values. Feel free to play around with it.
Hope this helps!

Get result from aggs in script ElasticSearch/Painless

I'm new in ElasticSearch world. I've been trying write simple request and I need to get aggs result in my script to make simple condition. Is it possible to do it in this way?
The condition below is only for example.
GET _search
{
"aggs" : {
"sum_field" : { "sum" : { "field" : "someField" } }
},
"script_fields": {
"script_name": {
"script": {
"lang": "painless",
"source": """
// get there aggs result (sum_field)
if(sum_field > 5){
return sum_field
}
"""
}
}
}
}
The requirement is to execute sum aggregation over multiple indexes having the same field name
Now with multiple indexes, you'll have to check if that particular field exists in that indexes or not AND if the field is of the same datatype.
Indexes
I've created three indexes, having a single field called num.
index_1
- num: long
index_2
- num: long
index_3
- num: text
: fielddata: true
Also notice how if the field is of type text, then I've set its property fielddata:true. But if you do not set it, then the below query would give you aggregation result as well as an error saying you cannot retrieve the value of type text as its an analyzed string and you can only use doc for fields which are non_analyzed.
Sample Query:
POST /_search
{
"size":0,
"query":{
"bool":{
"filter":[
{
"exists":{
"field":"num"
}
}
]
}
},
"aggs":{
"myaggs":{
"sum":{
"script":{
"source":"if(doc['num'].value instanceof long) return doc['num'].value;"
}
}
}
}
}
Query if you cannot set fielddata:true
In that case, you need to explicitly mention the indexes on which you'd want to aggregate.
POST /_search
{
"size":0,
"query":{
"bool":{
"filter":[
{
"exists":{
"field":"num"
}
},
{
"terms":{
"_index":[
"index_1",
"index_2"
]
}
}
]
}
},
"aggs":{
"myaggs":{
"sum":{
"script":{
"source":"if(doc['num'].value instanceof long) return doc['num'].value;"
}
}
}
}
}
Hope this helps!

Resources