How to do arithmetic in buckets with nested data in Elasticsearch - elasticsearch

I have a concept I want to write a query for in Elasticsearch, but I can't figure out how from the documentation.
Suppose I have documents in an index that look like this, with nested "owners". (I'm omitting the quotation marks to ease my typing)
[
{
id: 1,
cost: 8.50,
owners: [
{ ownerId: 11, share: 0.45 },
{ ownerId: 12, share: 0.55 }
]
},
{
id: 2,
cost: 12.00,
owners: [
{ ownerId: 11, share: 1.0 }
]
},
...
]
I'd like an aggregation that will multiply cost by the owner's share and bucket it by owner ID, then sum them up. So if I only had those two documents shown above, I'd get a aggregation with a bucket for ownerId 11 with a cost sum of 15.825 and one for ownerId 12 with a cost sum of 4.675. It seems like I can do this, but how?

Related

Reorder object hierarchy and group by time in JSONata

Although I'm not a total JSONata noob, I'm having a hard time finding an elegant solution to the following desired transformation. The starting point is a set of time-series data in a format like this:
{
"series1": {
"data": [
{"time": "2022-01-01T00:00:00Z", "value": 22},
{"time": "2022-01-02T00:00:00Z", "value": 23}
]
},
"series2": {
"data": [
{"time": "2022-01-01T00:00:00Z","value": 220},
{"time": "2022-01-02T00:00:00Z","value": 230}
]
}
}
I need to "flip the hierarchy", and group these datapoints by timestamp, into an array of objects, like follows:
[
{
"time": "2022-01-01T00:00:00Z",
"series1": 22,
"series2": 220
},
{
"time": "2022-01-02T00:00:00Z",
"series1": 23,
"series2": 230
}
]
I currently have this working with the expression
$each($, function($v, $s) {
[$v.data.{
'series': $s,
'time':$.time,
'value': $.value
}]
}).*{
`time`: {
`series`: value
}
}
~> $each(function($v, $t) {
$merge([
$v,
{'time': $t}
])
})
(playground link: https://try.jsonata.org/8CaggujJk)
...and...I can't help but feel that there must be a better way!
For reference, my current expression basically does this in three consecutive steps:
The first $each() function, which splits up the original object into an array of datapoints, with a series name, timestamp, and value of each.
A grouping operator which makes time a key, and gathers all values for a given timestamp together.
A second $each() function, which transforms the object into an array of objects where time is a value again, rather than a key - and merges the time key-value alongside the series values.
I've seen some wonderfully elegant solutions to similar problems on here, but am not sure how to approach this in a better way. Any tips appreciated!

Is there a way to create a runtime field in Elasticsearch that is equal to a 'Value'/'Sum of Value across index'?

I have a task to show the percent of value a set of filtered documents represents vs the entire value represented across a whole year. For example:
[{
name: 'Foo',
value: 12,
year: 2021
},
{
name: 'Bar',
value: 2,
year: 2021
},
{
name: 'Car',
value: 10,
year: 2021
},
{
name: 'Lar',
value: 4,
year: 2022
}]
I'd like to create a runtime field that would equal .5 for 'Foo' (12/(12+2+10)), .42 for 'Car' (10/(12+2+10)) and 1 for 'Lar' (4/4). Is this possible? Is there a better way to achieve this result? The ultimate goal is that if someone creates a query that returns 'Foo' and 'Car' they could sum the runtime field to get .92 (.5+.42) and that such a result could be used in a Kibana Lens visualization.
I've tried creating queries that return the above results, and that is easy enough, but those queries aren't usable inside Kibana which also has global filters to account for. That's why I thought a calculated field that represents the ratio of a document's value in relation to the sum of all documents' values would be useful.

Sum two fields in a nested array in RethinkDB

The following document exists in a table in RethinkDB:
{
u'destination_addresses':[
u'1 Rockefeller Plaza,
New York,
NY 10020,
USA',
u'Meadowlands,
PA 15301,
USA'
],
u'origin_addresses':[
u'1600 Pennsylvania Ave SE,
Washington,
DC 20003,
USA'
],
u'rows':[
{
u'elements':[
{
u'distance':{
u'text':u'288 mi',
u'value':464087
},
u'duration':{
u'text':u'5 hours 2 mins',
u'value':18142
},
u'status':u'OK'
},
{
u'distance':{
u'text':u'266 mi',
u'value':428756
},
u'duration':{
u'text':u'4 hours 6 mins',
u'value':14753
},
u'status':u'OK'
}
]
}
],
u'status':u'OK'
}
I am trying to sum the 'value' field for both duration and distance (so, getting the total distance and duration for a given trip, which is what one of these documents is from the Google Maps Distance API). I have tried a great many combinations of pluck (from the nested fields documentation) but cannot seem to get this working. I'm working in Python, and thanks in advance for any help.
Does this do what you want?
document['rows'].concat_map(lambda row: row['elements'])['distance']['value'].sum()

Boosting only results with a near-identical score in Elasticsearch

I'm using the following query to search through a database of names, allowing fuzzy matching but giving preference to exact matches.
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "x",
"operator": "and",
"boost": 10
}
}
},
{
"match": {
"name": {
"query": "x",
"fuzziness": "AUTO",
"operator": "and"
}
}
},
{
"match": {
"altname": {
"query": "x",
"fuzziness": "AUTO",
"operator": "and"
}
}
}
]
}
}
The database contains entries with identical names. If that happens, I would like to boost those entries by a second field, let's call it weight. However, I only want the boost to be applied between the subset of results with a (near) identical score, not to all of the results.
This is further complicated by the fact that results with an identical name may receive a slightly different score, as they are influenced by the relevancy on the altname field.
For example, querying for dog could give 3 results:
Dog [id 1, score 2.3, weight 10]
Dog [id 2, score 2.2, weight 20]
Doge [id 3, score 1, weight 100]
I'm looking for a query that would boost the result with id 2 to the top score. The result with id 3 should always stay at the bottom due to its poor relevancy, regardless of its weight. Ideally with tunable parameters to tweak the factor of the score vs. the factor of the weight.
Any way to do this in a single pass in Elasticsearch, of course without ruining performance?
Looks like I figured it out.
First, I realised that the example in my original question was more complex than necessary. I narrowed it down to: "How to compose a query for 'blub' that returns the following documents in the order 2, 3, 1"
id: 1
name: blub
weight: 0.01
---
id: 2
name: blub
weight: 0.1
---
id: 3
name: blub stuff
weight: 1
Thus: for the two documents with an identical (or very similar) score, the weight should be used as a tie-breaker. But documents with a significantly lower score should never be allowed to trump other results, regardless of their weight.
I loaded the data in the excellent Play tool: https://www.found.no/play/gist/edd93c69c015d4c62366#search and started experimenting.
Turned out the log2p modifier did exactly what I expected. Repeated it on a real-world dataset and everything looks exactly as expected.
function_score:
query:
match:
name: blub
field_value_factor:
field: weight
modifier: log2p

ElasticSearch custom score script does not preserve array ordering

I am using ElasticSearch with a function_score property to retrieve documents sorted by createdOn. The createdOn field is stored as an Array representing date values, i.e.
"createdOn": [ 2014, 4, 24, 22, 11, 47 ]
Where createdOn[0] is year, createdOn[1] is month, createdOn[2] is day, etc. I am testing the following query, which should return documents scored by year. However, the doc['createdOn'] array does not preserve the value of the elements. In this query, doc['createdOn'].values[0] returns 4, not 2014.
POST /example/1
{
name:"apple",
createdOn: [2014, 8, 22, 5, 12, 32]
}
POST /example/2
{
name:"apple",
createdOn: [2011, 8, 22, 5, 12, 32]
}
POST /example/3
{
name:"apple",
createdOn: [2013, 8, 22, 5, 12, 32]
}
POST /example/_search
{
"query":
{
"function_score": {
"boost_mode": "replace",
"query": {
"match_all": {}
},
"script_score" : {
"script": "doc['createdOn'].values[0]"
}
}
}
}
It appears that this is due to the way ElasticSearch caches data: http://elasticsearch-users.115913.n3.nabble.com/Accessing-array-field-within-Native-Plugin-td4042848.html:
The only apparent solution other than using the source method (which is slow), is to use nested queries. Any ideas on how I could rewrite my query using nested queries? It seems like the only efficient way to sort this query by year.
The docFieldDoubles method gets it's values from the in memory
structures of the field data cache. This is done for performance. The
field data cache is not loaded from source of the document (because
this will be slow) but from the lucene index, where the values are
sorted (for lookup speed). The get api does work based on the original
document source which is why you see those values in order (note- ES
doesn't the parse the source for the get api, it just gives you back
what you've put in it).
You can access the original document (which will be parsed) using the
SourceLookup (available from the source method) but it will be slow as
it needs to go to disk for every document.
I'm not sure about the exact semantics of what you are trying to
achieve, but did you try looking at nested objects? those allow you to
store a list of object in a why that keeps values together, like [{
"key": "k1" , "value" : "v1"},...].

Resources