elasticsearch fill gaps with previous value - elasticsearch

I have time series data in Elasticsearch and I want to aggregate it to create histogram. What I want to achieve is to fill the null buckets with the value of the previous data point. I know that I can use min_doc_count: 0 but it will put the value as 0 and I couldn't find any out of the box way to do this via Elastic. May be there is some trick that I am not aware of?
Appreciate your feedback.

I think the Date Histogram Aggregation does not provide a native way to perform what you would like.
The closest thing I can think of is using missing value. However, this will set a static value to all the dates where no values are found, which is not exactly what you want.
I also thought of using Painless with the following logic:
Get the first value in the Histogram and store it in a variable current.
If the next value is different to 0, store this value to current.
If the value is 0, set the current value to the histogram date. Don't change current.
Repeat step 2 until you finish the Histogram.
Using painless, in my experience is really painful but you can consider it as an alternative.
Additionally, I would recommend you to limit ES to perform searches and aggregations. If you require additional logic to the output, consider performing it outside ES. You can use the Python ES Client for instance.
I can think of the following script with a similar logic as the Painless scenario:
current = 0
results = es.search(...)
for i in res["aggregations"]["my_histogram_name"]["buckets"]:
if not i["doc_count"]: #this is the same as "if i["doc_count"]==0"
i["doc_count"] = current
current = i["doc_count"] #changed or not, we always use the last value to current
After that, the histogram should look as you want and ready to be displayed.
Hope this is helpful! :)

Related

Aerospike set expiration date for specific field

I have an Aerospike cache consists of list of data with value of json like structure.
example value: {"name": "John", "count": 10}
I was wandering if it is possible to set an expiration time for only the count field and reset it after some time.
Aerospike doesn't support such functionality out of the box. You would have to code this (hence your other post I guess:
Best way to update single field in Aerospike). You can add filters to only do this based on metadata of the record (the last update time of the record, accessible through Expressions) or any other logic and it should be super efficient and performant to then let a background ops query do the work.
Another approach can be adding your own custom expiration time stamp in your bin data like so:
{"name":"John", "count":10, "validTill":1672563600000000000}.
Here, I am using as below (you can use a different future timestamp format):
$ date --date="2023-01-01 09:00:00" +%s%N
1672563600000000000
Now when you read the record, read through an expression that returns count = 10 if your current clock is behind validTill, 0 otherwise. This can work if the count value on read is all you care about. Also, when you go to update the count value in a future write, you can use the same expression logic to update both count and validTill.
If this works for you, you don't have to scan and update records using background jobs.

Visualization in Elasticsearch using customized query

Here's the situation I have, suppose my index document looks like this :
{
"user" : 1
"started" : "2021-06-05"
"finished" : -1
"status": "ONGOING"
}
{
"user" : 2
"started" : "2021-06-05"
"finished" : "2021-06-06"
"status": "DONE"
}
Like this I have 100 docs indexed. The ongoing documents have -1 as the finished time and completed once have a valid timestamp. I want to visualize a graph that can give me the number of ongoing applications with the "started" field in the X-axis.
In the date histogram, I'm only able to get the filtered ongoing processes for that specific interval. But I want the count for the ongoing application to be counted for every interval until the document is updated with the finish time.
Is there anyway I can visualize this in Kibana? Even an elastic search query that can give me this output will do.
This is really similar to a problem I had and have now solved. I spent ages trying to create a query that does this to no avail, but luckily this can be achieved using Vega's transforms.
If you want to bin it evenly, not using start time as your x-values Here is the posted solution (look for my answer). The one thing I would add is; for the documents where you have "-1" as the finished time, if you do a formula transform you can round these to the end bin times.
However, if you still want to stick to the "started"/"finished" field being the points of summation/evaluation this is also possible. I'll give you a quick rundown on how to do this...
Method:
First thing you need to do is create two copies of your data with a common field referring to the "timestamp". The first dataset will have the "started" value assigned to the field "timestamp" (started dataset) and the second will have "finished" (finished dateset). You can achieve this using the formula transform.
You will then need to create a column in each dataset named "operation" referring to what that that data entry does - add a user or remove a user. For the finished dataset you want to assign a column of '-1's and the started dataset '1's. Again using formula transforms.
Then join these datasets back up. You now need to order by "timestamp" and cumulatively sum the "operation" column up. This can be achieved using the window transform.
This should give you the data needed to plot it. Arguably this is much more accurate than binning, but if your data set is large it can yield quite messy results - binning in this case is much cleaner.
Good luck, there is obviously a lot to fill in but a working example would of taken me quite a while to draw up - plus where is the fun in copying.

Is it possible to have a constant value in a calculated field in QuickSight?

In QuickSight, when you want to define a constant value to reuse it in visualizations later, you can try to set it as:
Calculated field: goalFor2020
Formula: 20000
But right now it doesn't allow you to put just a number in the formula.
Is there any way to do achieve having just a number in the formula of a calculated field?
The reason we need it is just to have a number that doesn't depend on any data, just manually defined by us.
Interesting, QuickSight lets me insert a number into a calculated field, just fine.
Since that isn't working for you, I'd recommend using a parameter with a default value. For example,
Parameters essentially has the same "rights" as a calculated field (it can be used in visuals, other calculated fields, etc...). It can also be passed via query parameters which may or may not be a feature that you'd find useful.
Another cool benefit of using parameters is that, if you're embedding QuickSight, you could retrieve this value dynamically and pass it to the dashboard. Then if you wanted to, say, generalize your for different yearly goals, the goal could be passed and dynamic (rather than hard-coded in a calculated field).
We could achieve it with a trick, just apply some function that returns a number to one of your columns, and make it 0, then add your constant number:
Calculated field: goalFor2020
Formula: count(email) * 0 + 20000
It does the trick, but there might be a better way to do it.
I have tried something like this:
distinct_countIf({dimension},{dimension}='xxx')*
+distinct_countIf({dimension},{dimension}='xxx')*
just makes the discount_countif meet the requirement, so it will return to 1. And use 1* the number you want to hardcode. If the requirement does not meet, it will return 0 so it won't add up the number

How to sort by a derived value that includes a moving date in ElasticSearch?

I have a requirement to sort the results returned by ElasticSearch by a special value i define, let's call it 'X'.
Now - the problem is, 'X' is a value derived based on:
field A in the document (which is a 'term')
field B (which is a 'date')
the current date (UTC)
So, the problem is obviously 3. The date always changes, therefore i'm not sure how to include this in the sort, since it's not part of the document.
From my initial reading it appears i can use a 'script' here, but i'm worried about the performance, since i could be searching + sorting over 1000's of documents.
The only other idea that came to mind is to calculate the value nightly, and store that in each document. But that has a few drawbacks:
i need to have something running in the background to update this value
could be a lot of documents to update (60%+ every night).
i lose precision for the value depending on how long between script runs. (if i run nightly, value is 23 hours 'stale')
Any advice?
Thanks
This can be done by having an ES script run nightly calculating value, and store that in each document

SAS: alternatives to First. and Last. variables when data can not be sorted?

Please help me with the following SAS problem. I need to transform my data set from "original" to "new" as shown in the picture. Because the "priority" variable can not be sorted, it seems that first. and last. variables would not work here, no? The goal is to have each sequence of priorities represent one entry in the "new" dataset.
Thank you!
p.s. I did not know how to create a table in this post so I just took a snapshot of the screen.
Seems fairly straightforward to me. Just create a batch ID.
data have_pret;
set have;
by subject;
if first.subject then batchID=0;
if priority=1 then batchID+1;
run;
Then you can transpose by subject/batchID. This assumes priority always starts at 1 - if it starts at > 1 sometimes, you may want to adjust your logic and keep track of prior value of priority.

Resources