is there a way to query range when maximal range is defined by an array with two numbers - elasticsearch

I need to write an elastic range query that operates on a following index format
...
"facetProperties": {
"fid641616": [
31.75,
44.45
]
}
...
the following query works only if lt or gt matches the lower or the upper bound of the max range. As soon as I try to narrow both ends, there are no results.
{
"query": {
"bool": {
"should": [{
"range": {
"facetProperties.fid641616": {
"gt": 33,
"lt": 42
}
}
}]
}
},
"from": 0,
"size": 250,
"sort": [
],
"aggs": {
},
"_source": "facetProperties.fid641616"
}
Is there a way to get this working without modifying the index?
update1 - some use cases:
query range:
"range": {
"facetProperties.fid641616": {
"gt": 33,
"lt": 42
}
}
facet1 : [31] - should not be found
facet2 : [31,45] - should be found
facet1 : [31,32] - should not be found
facet1 : [44,45] - should not be found

Basically it is not possible to query based on the range or difference of two numbers in an array using conventional DSL queries in ES but you can do that using script.
Below is the document and sample script that should help you.
Sample Document:
POST range_index/_doc/1
{
"array": [31.75, 44.45]
}
Query:
POST range_index/_search
{
"query": {
"script": {
"script": {
"source": """
List list = doc['array'];
if(list.size()==2){
long first_number = list.get(0);
long last_number = list.get(1);
if(params.gt < first_number)
return false;
if(params.lt > last_number)
return false;
if((last_number - first_number) >= (params.lt - params.gt))
return true;
}
return false;
""",
"params": {
"gt": 33,
"lt": 42
}
}
}
}
}
What I've done is simply created a script that would return you documents having the difference of gt and lt that you have mentioned in your query.
You should be able to view the document I've mentioned in the result. Note that I'm assuming that the field array would be in asc order.
Basically it would return all the documents having difference of 42-33 i.e. 9.
Let me know if that helps!

Related

Compute percentile with collapsing by user

Let says I have an index where I save a million of tweets (original object). I want to get the 90th percentile users based on the number of followers.
I know there is the aggregation "percentile" to do this, but my problem is that ElasticSearch use all documents so I have some users that tweet a lot who noise my calculation.
I want to isolate all unique user then compute the 90th.
The other constraint is that I want to do this in only one or two requests to keep the response lower than 500ms.
I have tried a lot of things and I was able to do this with "scripted_metric" but when my dataset exceed 100k of tweets the performances go down criticaly.
Any advice ?
Additionnal infos :
My index store orginal tweets & retweets based on user search queries
The index is mapped with a dynamic template mapping (No problem with this)
The index contains approximatly 100M
Unfortunately, "top hits" aggregation doesn't accept sub-aggs.
The request I try to achieve is :
{
"collapse": {
"field": "user.id" <--- I want this effect on aggregation
},
"query": {
"bool": {
"must": [
{
"term": {
"metadatas.clientId": {
"value": projectId
}
}
},
{
"match": {
"metadatas.blacklisted": false
}
}
],
"filter": [
{
"range": {
"publishedAt": {
"gte": "now-90d/d"
}
}
}
]
}
},
"aggs":{
"twitter": {
"percentiles": {
"field": "user.followers_count",
"percents": [95]
}
}
},
"size": 0
}
Finally, I figure out to find a workaround.
In percentile aggregation, I can use a script. I use params variable to hold unique keys then return preceding _score.
Without the complete explanation of the computation, I cannot fine tune the behavior of my script. But the result is good enough for me.
"aggs": {
"unique":{
"cardinality": {
"field": "collapse_profile"
}
},
"thresholds":{
"percentiles": {
"field": "user.followers_count",
"percents": [90],
"script": {
"source": """
if(params.keys == null){
params.keys = new HashMap();
}
def key = doc['user.id'].value;
def value = doc['user.followers_count'].value;
if(params.keys[key] == null){
params.keys[key] = _score;
return value;
}
return _score;
""",
"lang": "painless"
}
}
}
}

Query return the search difference on elasticsearch

How would the following query look:
Scenario:
I have two bases (base 1 and 2), with 1 column each, I would like to see the difference between them, that is, what exists in base 1 that does not exist in base 2, considering the fictitious names of the columns as hostname.
Example:
Selected value of Base1.Hostname is for Base2.Hostname?
YES → DO NOT RETURN
NO → RETURN
I have this in python for the following function:
def diff(first, second):
second = set (second)
return [item for item in first if item not in second]
Example match equal:
GET /base1/_search
{
"query": {
"multi_match": {
"query": "webserver",
"fields": [
"hostname"
],
"type": "phrase"
}
}
}
I would like to migrate this architecture to elastic search in order to generate forecast in the future with the frequency of change of these search in the bases
This could be done with aggregation.
Collect all the hostname from base1 & base2 index
For each hostname count occurrences in base2
Keep only the buckets that have base2 count 0
GET base*/_search
{
"size": 0,
"aggs": {
"all": {
"composite": {
"size": 10,
"sources": [
{
"host": {
"terms": {
"field": "hostname"
}
}
}
]
},
"aggs": {
"base2": {
"filter": {
"match": {
"_index": "base2"
}
}
},
"index_count_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"base2_count": "base2._count"
},
"script": "params.base2_count == 0"
}
}
}
}
}
}
By the way don't forget to use pagination to get rest of the result.
References :
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html
https://discuss.elastic.co/t/data-set-difference-between-fields-on-different-indexes/160015/4

Is it possible to sort by a range in Elasticsearch?

When I execute the following query:
{
"query": {
"bool": {
"filter": [
{
"match": {
"my_value": "hi"
}
},
{
"range": {
"my_range": {
"gt": 0,
"lte": 200
}
}
}
]
}
},
"sort": {
"my_range": {
"order": "asc",
"mode": "min"
}
}
}
I get the error:
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is not supported on field [my_range] of type [long_range]"
}
How can I enable a range datatype to be sortable? Is this possible?
Elasticsearch version: 5.4, but I am wondering if this is possible with ANY version.
More context
Not all documents in the alias/index have the range field. However, the query filters to only include documents with that field.
It is not straight-forward to sort using a field of range data type. Still you can use script based sorting to some extent to get the expected result.
e.g. For simplicity of script I'm assuming for all your docs, the data indexed against my_range field has data for gt and lte only and you want to sort based on the minimum values of the two then you can add the below for sorting:
{
"query": {
"bool": {
"filter": [
{
"match": {
"my_value": "hi"
}
},
{
"range": {
"my_range": {
"gt": 0,
"lte": 200
}
}
}
]
}
},
"sort": {
"_script": {
"type": "number",
"script": {
"lang": "painless",
"inline": "Math.min(params['_source']['my_range']['gt'], params['_source']['my_range']['lte'])"
},
"order": "asc"
}
}
}
You can modify the script as per your needs for complex data involving combination of all lt, gt, lte, gte.
Updates (Scripts for other different use cases):
1. Sort by difference
"Math.abs(params['_source']['my_range']['gt'] - params['_source']['my_range']['lte'])"
2. Sort by gt
"params['_source']['my_range']['gt']"
3. Sort by lte
"params['_source']['my_range']['lte']"
4. Sorting if query returns few docs which don't have range field
"if(params['_source']['my_range'] != null) { <sorting logic> } else { return 0; }"
Replace <sorting logic> with the required logic of sorting (which can be one of the 3 above or the one in the query)
return 0 can be replace by return -1 or anything other number as per the sorting needs
I think what you are looking for is sort based on the difference of the range coz I'm not sure if simply sorting on any of the range values would make any sense.
For e.g. if range for one document is 100, 300 and another 200, 600 then you would want to sort based on the difference for e.g. you would want the lesser range to be appearing i.e 300-100 = 200 to be appearing at the top.
If so, I've made use of the below painless script and implemented script based sorting.
Sorting based on difference in Range
POST <your_index_name>/_search
{
"query":{
"match_all":{
}
},
"sort":{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"params._source.my_range.lte-params._source.my_range.gte"
},
"order":"asc"
}
}
}
Note that in this case, sort won't be based on any of the field values of my_range but only on their differences. If you want to further sort based on the fields like lte, lt, gte or gt you can have your sort implemented with multiple script as below:
Sorting based on difference in Range + Range Field (my_range.lte)
POST <your_index_name>/_search
{
"query":{
"match_all":{
}
},
"sort":[
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"params._source.my_range.lte - params._source.my_range.gte"
},
"order":"asc"
}
},
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"params._source.my_range.lte"
},
"order":"asc"
}
}
]
}
So in this case even if for two documents, ranges are same, the one with the lesser my_range.lte would be showing up first.
Sort based on range field
However if you simply want to sort based on one of the range values, you can make use of below query.
POST <your_index_name>/_search
{
"query":{
"match_all":{
}
},
"sort":{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"params._source.my_range.lte"
},
"order":"asc"
}
}
}
Updated Answer to manage documents without range
This is for the scenario, Sort based on difference in range + Range.lte or Range.lt whichever is present
The below code what it does is,
Checks if the document has my_range field
If it doesn't have, then by default it would return Long.MAX_VALUE. This would mean if you sort by asc, this document should returned
last.
Further it would check if document has lte or lt and uses that value as high. Note that default value of high is Long.MAX_VALUE.
Similarly it would check if document has gte or gt and uses that value as low. Default value of low would be 0.
Calculate now high - low value on which sorting would be applied.
Updated Query
POST <your_index_name>/_search
{
"size":100,
"query":{
"match_all":{
}
},
"sort":[
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"""
if(params._source.my_range==null){
return Long.MAX_VALUE;
} else {
long high = Long.MAX_VALUE;
long low = 0L;
if(params._source.my_range.lte!=null){
high = params._source.my_range.lte;
} else if(params._source.my_range.lt!=null){
high = params._source.my_range.lt;
}
if(params._source.my_range.gte!=null){
low = params._source.my_range.gte;
} else if (params._source.my_range.gt==null){
low = params._source.my_range.gt;
}
return high - low;
}
"""
},
"order":"asc"
}
},
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"""
if(params._source.my_range==null){
return Long.MAX_VALUE;
}
long high = Long.MAX_VALUE;
if(params._source.my_range.lte!=null){
high = params._source.my_range.lte;
} else if(params._source.my_range.lt!=null){
high = params._source.my_range.lt;
}
return high;"""
},
"order":"asc"
}
}
]
}
This should work with ES 5.4. Hope it helps!
This can be resolved easily by using the regex interval filter :
Interval The interval option enables the use of numeric ranges,
enclosed by angle brackets "<>". For string: "foo80":
foo<1-100> # match
foo<01-100> # match
foo<001-100> # no match
Enabled with the INTERVAL or ALL flags.
Elactic docs
{
"query": {
"bool": {
"filter": [
{
"match": {
"my_value": "hi"
}
},
{
"regexp": {
"my_range": {
"value": "<0-200>"
}
}
}
]
}
},
"sort": {
"my_range": {
"order": "asc",
"mode": "min"
}
}
}

ElasticSearch max score

I'm trying to solve a performance issue we have when querying ElasticSearch for several thousand results. The basic idea is that we do some post-query processing and only show the Top X results ( Query may have ~100000 Results while we only need the top 100 according to our Score Mechanics ).
The basic mechanics are as follows:
ElasticSearch Score is normalized between 0..1 ( score/max(score) ), we add our ranking score ( also normalized between 0..1 ) and divide by 2.
What I'd like to do is move this logic into ElasticSearch using custom scoring ( or well, anything that works ): https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#function-script-score
The Problem I'm facing is that using Score Scripts / Score Functions I can't seem to find a way to do something like max(_score) to normalize the score between 0 and 1.
"script_score" : {
"script" : "(_score / max(_score) + doc['some_normalized_field'].value)/2"
}
Any ideas are welcome.
You can not get max_score before you have actually generated the _score for all the matching documents. script_score query will first generate the _score for all the matching documents and then max_score will be displayed by elasticsearch.
According to what i can understand from your problem, You want to preserve the max_score that was generated by the original query, before you applied "script_score". You can get the required result if you do some computation at the front-end. In short apply your formula at the front end and then sort the results.
you can save your factor inside your results using script_fields query.
{
"explain": true,
"query": {
"match_all": {}
},
"script_fields": {
"total_goals": {
"script": {
"lang": "painless",
"source": """
int total = 0;
for (int i = 0; i < doc['goals'].length; ++i) {
total += doc['goals'][i];
}
return total;
""",
"params":{
"last" : "any parameters required"
}
}
}
}
}
I am not sure that I understand your question. do you want to limit the amount of results?
are you tried?
{
"from" : 0, "size" : 10,
"query" : {
"term" : { "name" : "dennis" }
}
}
you can use sort to define sort order by default it will sorted by main query.
you can also use aggregations ( with or without function_score )
{
"query": {
"function_score": {
"functions": [
{
"gauss": {
"date": {
"scale": "3d",
"offset": "7d",
"decay": 0.1
}
}
},
{
"gauss": {
"priority": {
"origin": "0",
"scale": "100"
}
}
}
],
"query": {
"match" : { "body" : "dennis" }
}
}
},
"aggs": {
"hits": {
"top_hits": {
"size": 10
}
}
}
}
Based on this github ticket it is simply impossible to normalize score and they suggest to use boolean similarity as a workaround.

To find difference between two integer fields and check it falls under a specific range, using scripts in elasticsearch

I have two fields,let us name them "fieldA" and "fieldB" in my documents and i need to find the difference between them and check if that value falls under a specific range say "rangeA" or " rangeB" and then return the documents that matches my criteria.
The schema for data is as shown below:
{
"fieldA": 45
"fieldB":13
}
I need to find all the document which have the difference between "fieldA" and "fieldB" in between 30 and 35. How can i do this using scripting in elasticsearch?
This can also be done using aggregations and scripts like below:
{
"aggregations": {
"age_diff": {
"range": {
"script": "doc[\"fieldA\"].value - doc[\"fieldB\"].value",
"ranges": [
{
"from": 30,
"to": 35
}
]
}
}
}
}
This way you can just check how many documents falls under the specified range.But if you want to get the documents under the aggregations you can use "top_hits" aggregations.
More detailed discussion on aggregations can be found here and more about "top_hits" can be found in detail here
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "difference=doc['fieldA'].value-doc['fieldB'].value;return (difference>param1 && difference<param2);",
"params": {
"param1":30,
"param2":35
}
}
}
}
}
}

Resources