Sum two fields in a nested array in RethinkDB - rethinkdb

The following document exists in a table in RethinkDB:
{
u'destination_addresses':[
u'1 Rockefeller Plaza,
New York,
NY 10020,
USA',
u'Meadowlands,
PA 15301,
USA'
],
u'origin_addresses':[
u'1600 Pennsylvania Ave SE,
Washington,
DC 20003,
USA'
],
u'rows':[
{
u'elements':[
{
u'distance':{
u'text':u'288 mi',
u'value':464087
},
u'duration':{
u'text':u'5 hours 2 mins',
u'value':18142
},
u'status':u'OK'
},
{
u'distance':{
u'text':u'266 mi',
u'value':428756
},
u'duration':{
u'text':u'4 hours 6 mins',
u'value':14753
},
u'status':u'OK'
}
]
}
],
u'status':u'OK'
}
I am trying to sum the 'value' field for both duration and distance (so, getting the total distance and duration for a given trip, which is what one of these documents is from the Google Maps Distance API). I have tried a great many combinations of pluck (from the nested fields documentation) but cannot seem to get this working. I'm working in Python, and thanks in advance for any help.

Does this do what you want?
document['rows'].concat_map(lambda row: row['elements'])['distance']['value'].sum()

Related

Reorder object hierarchy and group by time in JSONata

Although I'm not a total JSONata noob, I'm having a hard time finding an elegant solution to the following desired transformation. The starting point is a set of time-series data in a format like this:
{
"series1": {
"data": [
{"time": "2022-01-01T00:00:00Z", "value": 22},
{"time": "2022-01-02T00:00:00Z", "value": 23}
]
},
"series2": {
"data": [
{"time": "2022-01-01T00:00:00Z","value": 220},
{"time": "2022-01-02T00:00:00Z","value": 230}
]
}
}
I need to "flip the hierarchy", and group these datapoints by timestamp, into an array of objects, like follows:
[
{
"time": "2022-01-01T00:00:00Z",
"series1": 22,
"series2": 220
},
{
"time": "2022-01-02T00:00:00Z",
"series1": 23,
"series2": 230
}
]
I currently have this working with the expression
$each($, function($v, $s) {
[$v.data.{
'series': $s,
'time':$.time,
'value': $.value
}]
}).*{
`time`: {
`series`: value
}
}
~> $each(function($v, $t) {
$merge([
$v,
{'time': $t}
])
})
(playground link: https://try.jsonata.org/8CaggujJk)
...and...I can't help but feel that there must be a better way!
For reference, my current expression basically does this in three consecutive steps:
The first $each() function, which splits up the original object into an array of datapoints, with a series name, timestamp, and value of each.
A grouping operator which makes time a key, and gathers all values for a given timestamp together.
A second $each() function, which transforms the object into an array of objects where time is a value again, rather than a key - and merges the time key-value alongside the series values.
I've seen some wonderfully elegant solutions to similar problems on here, but am not sure how to approach this in a better way. Any tips appreciated!

Binning Data With Two Timestamps

I'm posting because I have found no content surrounding this topic.
My goal is essentially to produce a time-binned graph that plots some aggregated value. For Example. Usually this would be a doddle, since there is a single timestamp for each value, making it relatively straight forward to bin.
However, my problem lies in having two timestamps for each value - a start and an end. Similar to a gantt chart, here is an example of my plotted data. I essentially want to bin the values (average) for when the timelines exist within said bin (bin boundaries could be where a new/old task starts/ends). Likeso.
I'm looking for a basic example or an answer to whether this is even supported, in Vega-Lite. My current working example would yield no benefit to this discussion.
I see that you found a Vega solution, but I think in Vega-Lite what you were looking for was something like the following. You put the start field in "x" and the end field in x2, add bin and type to x and all should work.
"encoding": {
"x": {
"field": "start_time",
"bin": { "binned": true },
"type": "temporal",
"title": "Time"
},
"x2": {
"field": "end_time"
}
}
I lost my old account, but I was the person who posted this. Here is my solution to my question. The value I am aggregating here is the sum of times the timelines for each datapoint is contained within each bin.
First you want to use a join aggregate to get the max and min times your data extend to. You could also hardcode this.
{
type: joinaggregate
fields: [
startTime
endTime
]
ops: [
min
max
]
as: [
min
max
]
}
You want to find a step for your bins, you can hard code this later or use a formula and write this into a new field.
You want to create two new fields in your data that is a sequence between the max and min, and the other the same sequence offset by your step.
{
type: formula
expr: sequence(datum.min, datum.max, datum.step)
as: startBin
}
{
type: formula
expr: sequence(datum.min + datum.step, datum.max + datum.step, datum.step)
as: endBin
}
The new fields will be arrays. So if we go ahead and use a flatten transform we will get a row for each data value in each bin.
{
type: flatten
fields: [
startBin
endBin
]
}
You then want to calculate the total time your data spans across each specific bin. In order to do this you will need to round up the start time to the bin start and round down the end time to the bin end. Then taking the difference between the start and end times.
{
type: formula
expr: if(datum.startTime<datum.startBin, datum.startBin, if(datum.startTime>datum.endBin, datum.endBin, datum.startTime))
as: startBinTime
}
{
type: formula
expr: if(datum.endTime<datum.startBin, datum.startBin, if(datum.endTime>datum.endBin, datum.endBin, datum.endTime))
as: endBinTime
}
{
type: formula
expr: datum.endBinTime - datum.startBinTime
as: timeInBin
}
Finally, you just need to aggregate the data by the bins and sum up these times. Then your data is ready to be plotted.
{
type: aggregate
groupby: [
startBin
endBin
]
fields: [
timeInBin
]
ops: [
sum
]
as: [
timeInBin
]
}
Although this solution is long, it is relatively easily to implement in the transform section of your data. From my experience this runs fast and just displays how versatile Vega can be. Freedom to visualisations!

Elastic Ingest Pipeline split field and create a nested field

Dear freindly helpers,
I have an index that is fed by a database via Kafka. Now this database holds a field that aggregates a couple of pieces of information like so key/value; key/value; (don't ask for the reason, I have no idea who designed it liked that and why ;-) )
93/4; 34/12;
it can be empty, or it can hold 1..n key/value pairs.
I want to use an ingest pipeline and ideally have a "nested" field which holds all values that are in tha field.
Probably like this:
{"categories":
{ "93": 7,
"82": 4
}
}
The use case is the following: we want to visualize the sum of a filtered number of these categories (they tell me how many minutes a specific process took longer) and relate them in ranges.
Example: I filter categories x, y ,z and then group how many documents for the day had no delay, which had a delay up to 5 minutes and which had a delay between 5 and 15 minutes.
I have tried to get the fields neatly separated with the kv processor and wanted to work from there on but it was a complete wrong approach I guess.
"kv": {
"field": "IncomingField",
"field_split": ";",
"value_split": "/",
"target_field": "delays",
"ignore_missing": true,
"trim_key": "\\s",
"trim_value": "\\s",
"ignore_failure": true
}
When I test the pipeline it seems ok
"delays": {
"62": "3",
"86": "2"
}
but there are two things that don't work.
I can't know upfront how many of these combinations I have and thus converting the values from string t int in the same pipeline is an issue.
When I want to create a kibana index pattern I end up with many fields like delay.82 and delay.82.keyword which does not make sense at all for the usecase as I can't filter (get only the sum of delays where the key is one of x,y,z) and aggregate.
I have looked into other processors (dorexpander) but can't really get my head around how to get this working.
I hope my question is clear (I lack english skills, sorry) and that someone can point me at the right direction.
Thank you very much!
You should rather structure them as an array of objects with shared accessors, for instance:
[ {key: 93, value: 7}, ...]
That way, you'll be able to aggregate on categories.key and categories.value.
So this means iterating the categories' entrySet() using a custom script processor like so:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "extracts k/v pairs",
"processors": [
{
"script": {
"source": """
def categories = ctx.categories;
def kv_pairs = new ArrayList();
for (def pair : categories.entrySet()) {
def k = pair.getKey();
def v = pair.getValue();
kv_pairs.add(["key": k, "value": v]);
}
ctx.categories = kv_pairs;
"""
}
}
]
},
"docs": [
{
"_source": {
"categories": {
"82": 4,
"93": 7
}
}
}
]
}
P.S.: Do make sure your categories field is mapped as nested b/c otherwise you'll lose the connections between the keys & the values (also called flattening).

Repeated values in Elasticsearch array and query scoring

I have two documents with a field country which can contain repeated values, e.g.
Doc1:
country: [US, US, GB, US]
Doc2:
country: [US, GB]
I need a query that when looking for country:US will assign a higher score to Doc1 than Doc2 since US appears multiple times in the country field of Doc1, while it will assign the same score to the two documents when looking for country:GB as it appears the same number of times in both documents. Is this something achievable with Elasticsearch?
If you are doing a simple match search on US
GET countryindex/_search
{
"query": {
"match": {
"country": "US"
}
}
}
It will give more score to more frequency of elements so [US, US, GB, US] will get more score than "[US, GB]"
If you will search for "GB" -->"[US, GB]" will get more score than [US, US, GB, US], since shorter field length gets more score.
If you want to give same score when number of matches is same , you need to give norms: false in your mapping.
{
"properties": {
"title": {
"type": "text",
"norms": false
}
}
}

How to do arithmetic in buckets with nested data in Elasticsearch

I have a concept I want to write a query for in Elasticsearch, but I can't figure out how from the documentation.
Suppose I have documents in an index that look like this, with nested "owners". (I'm omitting the quotation marks to ease my typing)
[
{
id: 1,
cost: 8.50,
owners: [
{ ownerId: 11, share: 0.45 },
{ ownerId: 12, share: 0.55 }
]
},
{
id: 2,
cost: 12.00,
owners: [
{ ownerId: 11, share: 1.0 }
]
},
...
]
I'd like an aggregation that will multiply cost by the owner's share and bucket it by owner ID, then sum them up. So if I only had those two documents shown above, I'd get a aggregation with a bucket for ownerId 11 with a cost sum of 15.825 and one for ownerId 12 with a cost sum of 4.675. It seems like I can do this, but how?

Resources