Binning Data With Two Timestamps - timeline

I'm posting because I have found no content surrounding this topic.
My goal is essentially to produce a time-binned graph that plots some aggregated value. For Example. Usually this would be a doddle, since there is a single timestamp for each value, making it relatively straight forward to bin.
However, my problem lies in having two timestamps for each value - a start and an end. Similar to a gantt chart, here is an example of my plotted data. I essentially want to bin the values (average) for when the timelines exist within said bin (bin boundaries could be where a new/old task starts/ends). Likeso.
I'm looking for a basic example or an answer to whether this is even supported, in Vega-Lite. My current working example would yield no benefit to this discussion.

I see that you found a Vega solution, but I think in Vega-Lite what you were looking for was something like the following. You put the start field in "x" and the end field in x2, add bin and type to x and all should work.
"encoding": {
"x": {
"field": "start_time",
"bin": { "binned": true },
"type": "temporal",
"title": "Time"
},
"x2": {
"field": "end_time"
}
}

I lost my old account, but I was the person who posted this. Here is my solution to my question. The value I am aggregating here is the sum of times the timelines for each datapoint is contained within each bin.
First you want to use a join aggregate to get the max and min times your data extend to. You could also hardcode this.
{
type: joinaggregate
fields: [
startTime
endTime
]
ops: [
min
max
]
as: [
min
max
]
}
You want to find a step for your bins, you can hard code this later or use a formula and write this into a new field.
You want to create two new fields in your data that is a sequence between the max and min, and the other the same sequence offset by your step.
{
type: formula
expr: sequence(datum.min, datum.max, datum.step)
as: startBin
}
{
type: formula
expr: sequence(datum.min + datum.step, datum.max + datum.step, datum.step)
as: endBin
}
The new fields will be arrays. So if we go ahead and use a flatten transform we will get a row for each data value in each bin.
{
type: flatten
fields: [
startBin
endBin
]
}
You then want to calculate the total time your data spans across each specific bin. In order to do this you will need to round up the start time to the bin start and round down the end time to the bin end. Then taking the difference between the start and end times.
{
type: formula
expr: if(datum.startTime<datum.startBin, datum.startBin, if(datum.startTime>datum.endBin, datum.endBin, datum.startTime))
as: startBinTime
}
{
type: formula
expr: if(datum.endTime<datum.startBin, datum.startBin, if(datum.endTime>datum.endBin, datum.endBin, datum.endTime))
as: endBinTime
}
{
type: formula
expr: datum.endBinTime - datum.startBinTime
as: timeInBin
}
Finally, you just need to aggregate the data by the bins and sum up these times. Then your data is ready to be plotted.
{
type: aggregate
groupby: [
startBin
endBin
]
fields: [
timeInBin
]
ops: [
sum
]
as: [
timeInBin
]
}
Although this solution is long, it is relatively easily to implement in the transform section of your data. From my experience this runs fast and just displays how versatile Vega can be. Freedom to visualisations!

Related

Reorder object hierarchy and group by time in JSONata

Although I'm not a total JSONata noob, I'm having a hard time finding an elegant solution to the following desired transformation. The starting point is a set of time-series data in a format like this:
{
"series1": {
"data": [
{"time": "2022-01-01T00:00:00Z", "value": 22},
{"time": "2022-01-02T00:00:00Z", "value": 23}
]
},
"series2": {
"data": [
{"time": "2022-01-01T00:00:00Z","value": 220},
{"time": "2022-01-02T00:00:00Z","value": 230}
]
}
}
I need to "flip the hierarchy", and group these datapoints by timestamp, into an array of objects, like follows:
[
{
"time": "2022-01-01T00:00:00Z",
"series1": 22,
"series2": 220
},
{
"time": "2022-01-02T00:00:00Z",
"series1": 23,
"series2": 230
}
]
I currently have this working with the expression
$each($, function($v, $s) {
[$v.data.{
'series': $s,
'time':$.time,
'value': $.value
}]
}).*{
`time`: {
`series`: value
}
}
~> $each(function($v, $t) {
$merge([
$v,
{'time': $t}
])
})
(playground link: https://try.jsonata.org/8CaggujJk)
...and...I can't help but feel that there must be a better way!
For reference, my current expression basically does this in three consecutive steps:
The first $each() function, which splits up the original object into an array of datapoints, with a series name, timestamp, and value of each.
A grouping operator which makes time a key, and gathers all values for a given timestamp together.
A second $each() function, which transforms the object into an array of objects where time is a value again, rather than a key - and merges the time key-value alongside the series values.
I've seen some wonderfully elegant solutions to similar problems on here, but am not sure how to approach this in a better way. Any tips appreciated!

Elastic Ingest Pipeline split field and create a nested field

Dear freindly helpers,
I have an index that is fed by a database via Kafka. Now this database holds a field that aggregates a couple of pieces of information like so key/value; key/value; (don't ask for the reason, I have no idea who designed it liked that and why ;-) )
93/4; 34/12;
it can be empty, or it can hold 1..n key/value pairs.
I want to use an ingest pipeline and ideally have a "nested" field which holds all values that are in tha field.
Probably like this:
{"categories":
{ "93": 7,
"82": 4
}
}
The use case is the following: we want to visualize the sum of a filtered number of these categories (they tell me how many minutes a specific process took longer) and relate them in ranges.
Example: I filter categories x, y ,z and then group how many documents for the day had no delay, which had a delay up to 5 minutes and which had a delay between 5 and 15 minutes.
I have tried to get the fields neatly separated with the kv processor and wanted to work from there on but it was a complete wrong approach I guess.
"kv": {
"field": "IncomingField",
"field_split": ";",
"value_split": "/",
"target_field": "delays",
"ignore_missing": true,
"trim_key": "\\s",
"trim_value": "\\s",
"ignore_failure": true
}
When I test the pipeline it seems ok
"delays": {
"62": "3",
"86": "2"
}
but there are two things that don't work.
I can't know upfront how many of these combinations I have and thus converting the values from string t int in the same pipeline is an issue.
When I want to create a kibana index pattern I end up with many fields like delay.82 and delay.82.keyword which does not make sense at all for the usecase as I can't filter (get only the sum of delays where the key is one of x,y,z) and aggregate.
I have looked into other processors (dorexpander) but can't really get my head around how to get this working.
I hope my question is clear (I lack english skills, sorry) and that someone can point me at the right direction.
Thank you very much!
You should rather structure them as an array of objects with shared accessors, for instance:
[ {key: 93, value: 7}, ...]
That way, you'll be able to aggregate on categories.key and categories.value.
So this means iterating the categories' entrySet() using a custom script processor like so:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "extracts k/v pairs",
"processors": [
{
"script": {
"source": """
def categories = ctx.categories;
def kv_pairs = new ArrayList();
for (def pair : categories.entrySet()) {
def k = pair.getKey();
def v = pair.getValue();
kv_pairs.add(["key": k, "value": v]);
}
ctx.categories = kv_pairs;
"""
}
}
]
},
"docs": [
{
"_source": {
"categories": {
"82": 4,
"93": 7
}
}
}
]
}
P.S.: Do make sure your categories field is mapped as nested b/c otherwise you'll lose the connections between the keys & the values (also called flattening).

Elasticsearch calculate Max with cutoff

its an strange requirement.
we need to calculate a MAX value in our dataset, however, some of our data are BAD meaning, the MAX value will produce an undesired outcome.
say the values in field "myField" are:
INPUT:
10 30 20 40 1000000
CURRENT OUTPUT:
1000000
DESIRED OUTPUT:
40
{"aggs": {
"aggs": {
"maximum": {
"max": {
"field": "myField"
}
}
}
}
}
I thought of sorting the data but that'll be really slow as the actual data counts to 100K+.
So my question, is there a way to cutoff data in aggs so it ignores the actual MAX and return the SECOND MAX, Alternatively to ignore say the top 10% and returns the max value.
have you thought of using percentiles to eliminate outliers? Maybe run a percentile aggregation first and then use that as a base for a range filter?
The requirement seems a bit blurry to me, so this is just another try to help, not sure if this is what you are after.

Order Terms Aggregation by Geo Distance

So I have an issue here...
I'm using chewy ruby gem to communicate with Elasticsearch
=> #<Chewy::SnippetPagesIndex::Query:0x007f911c6b1610
#_collection=nil,
#_fully_qualified_named_aggs={"chewy::snippetpagesindex"=>{"chewy::snippetpagesindex::snippetpage"=>{}}},
#_indexes=[Chewy::SnippetPagesIndex],
#_named_aggs={},
#_request=nil,
#_response=nil,
#_results=nil,
#_types=[],
#criteria=
#<Chewy::Query::Criteria:0x007f911c6b1458
#aggregations=
{:group_by=>{:terms=>{:field=>"seo_area.suburb.id", :order=>{:_count=>"asc"}}, :aggs=>{:by_top_hit=>{:top_hits=>{:size=>10}}}}},
#facets={},
#fields=[],
#filters=
[{:geo_distance=>{:distance=>"100km", "seo_area.suburb.coordinates"=>"-27.9836052, 153.3977354"}},
{:bool=>
{:must_not=>[{:terms=>{:id=>[1]}}, {:terms=>{"seo_area.suburb.id"=>[5559]}}],
:must=>[{:term=>{:path_category=>"garden-services"}}, {:term=>{:status=>"active"}}, {:exists=>{:field=>"path_area"}}],
:should=>[]}}],
#options=
{:query_mode=>:must,
:filter_mode=>:and,
:post_filter_mode=>:and,
:preload=>
{:scope=>
#<Proc:0x007f911c6b1700#/Users/serviceseeking/Work/serviceseeking/engines/seo/app/concepts/seo/snippet_page/twins/search.rb:45 (lambda)>},
:loaded_objects=>true},
#post_filters=[],
#queries=[],
#request_options={},
#scores=[],
#script_fields={},
#search_options={},
#sort=[{:_geo_distance=>{"seo_area.suburb.coordinates"=>"-27.9836052, 153.3977354", :order=>"asc", :unit=>"km"}}],
#suggest={},
#types=[]>,
#options={}>
I'm using Elasticsearch aggregation so any sorting from the query/search phase will be gone upon accessing the aggregation.
What I've been passing is this...
aggs: {
by_seo_area_suburb_id: {
terms: {
field: "seo_area.suburb.id",
size: 10,
order: { by_distance: "desc" }
},
aggs: {
by_top_hit: {
top_hits: { size: 10 }
},
by_distance: {
geo_distance: {
field: "seo_area.suburb.coordinates",
origin: "52.3760, 4.894",
ranges: [
{ from: 0, to: 1 },
{ from: 1, to: 2 }
]
}
}
}
}
}
I'm getting this error though...
[500] {"error":{"root_cause":[{"type":"aggregation_execution_exception","reason":"Invalid terms aggregation order path [by_distance]. Terms buckets can only be sorted on a sub-aggregator path that is built out of zero or more single-bucket aggregations within the path and a final single-bucket or a metrics aggregation at the path end."}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"snippet_pages","node":"srrlBssmSEGsqpZnPnOJmA","reason":{"type":"aggregation_execution_exception","reason":"Invalid terms aggregation order path [by_distance]. Terms buckets can only be sorted on a sub-aggregator path that is built out of zero or more single-bucket aggregations within the path and a final single-bucket or a metrics aggregation at the path end."}}]},"status":500}
Simply says...
Terms buckets can only be sorted on a sub-aggregator path that is built out of zero or more single-bucket aggregations within the path and a final single-bucket or a metrics aggregation at the path end.
Any ideas?
You have Buckets like this:
1-2
2-3
4-5
and so on. These are no single value buckets with a natural order. Thats what the exception is telling you. So you need something to melt it down to single values.
Even if you could order by that. Why would you? All with a distance between 1 and 2 would have the same value for comparison and their ordering would be undefined. If its enough for you to know which are 0-1 and 1-2 and so on just turn around the aggregation order. First take the distance and make a subaggregation for terms.
All in all I think you have a usecase in which aggregations are not what you want because consider the following two documents:
{ name: "peter", location: [0,0] }
{ name: "peter", location: [100,0] }
obviously both peters would melt to one in a terms aggregation. But they have two different locations and therefore the distance will (nearly) always be different. So how can you order peters by distance? As soon as you aggregate a field all other fields more or less become decoupled from it and you cannot use other fields for that.
So. If you want something like this you most likely have to go via the normal search. Have a look at this on how to sort a search by distance:
https://www.elastic.co/guide/en/elasticsearch/guide/current/sorting-by-distance.html

ElasticSearch Custom Scoring with Arrays

Could anyone advice me on how to do custom scoring in ElasticSearch when searching for an array of keywords from an array of keywords?
For example, let's say there is an array of keywords in each document, like so:
{ // doc 1
keywords : [
red : {
weight : 1
},
green : {
weight : 2.0
},
blue : {
weight: 3.0
},
yellow : {
weight: 4.3
}
]
},
{ // doc 2
keywords : [
red : {
weight : 1.9
},
pink : {
weight : 7.2
},
white : {
weight: 3.1
},
]
},
...
And I want to get scores for each documents based on a search that matches keywords against this array:
{
keywords : [
red : {
weight : 2.2
},
blue : {
weight : 3.3
},
]
}
But instead of just determining whether they match, I want to use a very specific scoring algorithm:
Scoring a single field is easy enough, but I don't know how to manage it with arrays. Any thoughts?
Ah an interesting question! (And one I think we can solve with some communication)
Firstly, have you looked at custom script scoring? I'm pretty sure you can do this slowly with that. If you were to do this I would consider doing a rescore phase where scoring is only calculated after the doc is known to be a hit.
However I think you can do this with elasticsearch machinery. As I can work out you are doing a dot-product between docs, (where the weights are actually half way between what you are specifying and 1).
So, my first suggestion remove the x/2n term from your "custom scoring" (dot product) and put your weights half way between 1 and the custom weight (e.g. 1.9 => 1.45).
... I'm sorry I will have to come back and edit this question. I was thinking about using nested docs with a field defined boost level, but alas, the _boost mapping parameter is only available for the root doc
p.s. Just had a thought, you could have fields with defined boost levels and store teh terms there, then you can do this easily but you loose precision. A doc would then look like:
{
"boost_1": ["aquamarine"],
"boost_2": null, //don't need to send this, just showing for clarity
...
"boost_5": ["burgundy", "fuschia"]
...
}
You could then define a these boostings in your mapping. One thing to note is a fields boost value carries over to the _all field, so you would now have a bag of weighted terms in your _all field, then you could construct a bool: should query, with lots of term queries with different boost (for the weights of the second doc).
Let me know what you think! A very, very interesting question.

Resources