Clickhouse downsample into OHLC different time bar intervals - clickhouse

How would you do all kinds of time frames especially time frames where you have gaps in time over 24hr? Like accounting for close/swap on 24hr currency markets or commodities / indexes ?
clickhouse downsample into OHLC time bar intervals on stackoverflow
Shows how to do 1 minute but does not show how to-do different time frames or deal with swap / close time during 24hr.
How would one go about using time in golang with clickhouse so time.Time from golang could be used in clickhouse date where clause ?

Related

Apache Spark: How to detect data skew using Spark web UI

Data skew is something that hapen offen, that should be detected and treated correctly, I'm able to detect data skew in specific table using a groupby/count query in the joining key, however I have multiple joins in my application and doing that for each join can take time.
So is it possible to detect data skew directlly in the spark web ui which will saves me time ?
Data skew mean that you will have partitions that are significantly bigger than some other partitions.
For me, I usually check 2 things, In the stage tab, sort by decreasing duration, then click on tasks that are slow:
1- Check Summary Metrics which is one of the most important parts of the Spark UI. It gives you information about how your data is distributed among your partitions.
So to detect skew you can compare duration in Median and in Max columns, ideally the 2 values should be the same, when the difference between the two is bigger than defiantly there's a data skew, for example in the below picture:
Which means some tasks in that stage are taking too much time (31min) compared to other that takes only 1.1 minutes because of partitions size imbalance, the Min duration is also low which indicates that some partitions are nearly empty.
2- In the bottom of the stage You can find all tasks related to that stage, sort them by decreasing duration, then by Increasing duration, make sure that min duration and max duration are close if not than there are skewed in the you partitions, like in the picture below:

Get the last value of a metric in a Datadog dashboard

I'm searching to display in my Datadog dashboard the last value of a metric in a QueryValue field.
For the moment, I'm using
"queries": [
{
"query": "max:blabla.mycount{$env}",
"data_source": "metrics",
"name": "query1",
"aggregator": "last"
}
]
Is this the right way to do that ? For this series of mycount [20,1,5,3,2], which number will be taken ? Is it really the last one of the serie (2) or the biggest one in the serie (20) ?
Regards,
Blured.
So there's going to be 3 levels of aggregation to consider: the Time Aggregation and Space Aggregation of your query, and then the aggregation of the query value widget on the frontend (which is what you're asking about). For now, let's understand time aggregation by thinking of a time series widget, and then we'll see what happens with the query value widget after.
Space aggregation is the simplest one. The idea is the you have multiple time series being submitted from multiple applications/ servers. If 20 computers send a metric all at the same time, which metric should we pick to display? You decide that with the aggregation chunk of your query, yours is currently set to max.
The idea is that you have to decide which out of the dozens or hundreds of instances of your metric is the one you want to display.
If you don't want to worry about space aggregation, you have to make you query specific enough that only 1 time series exists for that metric. For example a cpu metric will need to be scoped to at least the hostname. For a container metric, hostname isn't enough, you would need at least the container_id. For a database there should be a db_identifier or something that gets you just 1 result back.
Now for time aggregation, let's look at the docs a bit:
As Datadog stores data at a 1 second granularity, it cannot display all real data on graphs. See How data is aggregated in graphs for more details.
For a graph on a 1-week time window, it would require sending hundreds of thousands of values to your browser—and besides, not all these points could be graphed on a widget occupying a small portion of your screen.
...
The Datadog backend tries to keep the number of intervals to a number below ~300.
https://docs.datadoghq.com/dashboards/guide/query-to-the-graph/#proceed-to-time-aggregation
So for example if you are looking at a 5 minute window, the time aggregation will be as granular as possible. there are 300 seconds in 5 minutes, so every interval on the graph will represent 1 second. If we zoomed out to 10 minutes (600 seconds), we can only show data every 2 seconds. So each bucket will represent 2 data points (assuming the metric is submitted every second).
In most scenarios your metrics are being submitted at a 15 second interval. So you won't notice any time aggregation rollups until 15*300=4500 seconds (a bit over an hour).
You control this with the rollup function, as described in the docs. If you don't want to worry about time aggregation, just make sure your time range is zoomed in enough to not have any bucketing.
And now for the last level of aggregation, the query value widget. You now have obtained a set of 300 points from the backend, space and time aggregation has already been applied. Out of those 300 datapoints, which one do you want to display? You could choose the last point, or a sum of the points, or whatever.
Hopefully that helps!

Elasticsearch - Search with in near realtime (1 sec)

I come across the following phrase https://www.elastic.co/guide/en/elasticsearch/reference/6.8/documents-indices.html
When a document is stored, it is indexed and fully searchable in near real-time—​within 1 second.
Assuming the 1 sec is subjective and depends on various factors , can we safely assume it is atleast 1 sec ? And also, I see different time intervals that will kickin as part of the indexing like refresh interval, etc , is this 1 sec is approximately sum of all those intervals (intermediate )
Howmuch realtime it is when we say elasticsearch is (near) realtime search engine
The default refresh interval (controlled by the index setting index.refresh_interval) is one second. The sentence you cite means exactly that. By default, a document you index will be available for search within at most one second, but it can be less than that.
If a refresh happens at instant T and you index a document at that same moment, then the underlying segments will be refreshed in pretty much exactly one second and your document will be searchable after that refresh.
If a refresh happens at instant T, and you index your document 500ms after that instant, then it will be available for search just 500ms after being indexed.
That also means your document could be available just a few milliseconds (say 10ms) after being indexed if you index it at instant T+990ms after the last refresh that happened at instant T.
It's not exact science, so that one second should be taken with a grain of salt, sometimes it could last a tad longer, say 10xx ms, where xx depends on various factors. You should not rely on that duration being nano-exact, though.
So near-real time simply means the duration of that refresh interval (which you can modify).

Best data structure to find price difference between two time frames

I am working on a project where my task is to find the % price change given 2 time frames for 100+ stocks in an efficient way.
The time frames are pre defined and can only be 5 mins, 10 mins, 30 mins, 1 hour, 4 hour, 12 hour, 24 hours.
Given a time frame, I need a way to efficiently figure out the % price change of all the stocks that I am tracking.
As per the current implementation, I am getting price data for those stocks every second and dumping the data to a price table.
I have another cron job which updates the % change of the stock based on the values in the price table every few seconds.
The solution is kind of in working state but its not efficient. Is there any data structure/ algorithm that I can use to find the % change efficiently?

How to design a system in which we can query top results in last n hours

I was asked this question in an interview. The details were that assume we are getting millions of events. Each event has a timestamp and other details. The systems design requires ability to enable end user to query most frequent records in last 10 minutes or 9 hours or may be 3 months.
Event can be seen as following
event_type: {CRUD + Search}
event_info: xxx
timestamp : ts...
The easiest way to to figure out this is to look at how other stream processing or map reduce libraries do this (and I have feeling your interviewers have seen these libraries). Its basically real time map reduce (you can lookup how that works as well).
I will outline two techniques for event processing. In reality most companies need to do both.
New school Stream processing (real time)
Lets assume for now they don't want the actual events but the more likely case of aggregates (I think that was the intent of your question)
An example stream processing project is pipelinedb (they have how it works on the bottom of their home page).
Events go into use a queue/ring buffer
A worker process reads those events in batches and rolls them up into partial buckets or window.
Finally there is combiner or reducer which takes the micro batches and actually does the updating. An example would be event counts. Because we are using a queue from above events come in ordered and depending on the queue we might be able to have multiple consumers that do the combing operation.
So if you want minute counts you would do rollups per minute and only store the sum of the events for that minute. This turns out to be fairly small space wise so you can store this in memory.
If you wanted those counts for month or day or even year you would just add up all the minute count buckets.
Now there is of course a major problem with this technique. You need to know what aggregates and pivots you would like to collect a priori.
But you get extremely fast look up of results.
Old school data warehousing (partitioning) and Map Reduce (batch processed)
Now lets assume they do want the actual events for a certain time period. This is expensive because if you store all the events in one place the lookup and retrieval is difficult. But if you use the fact that time is hierarchal you can store the events in a tree of tuples.
Reasons you would want the actual events is because you are doing adhoc querying and are willing to wait for the queries to perform.
You need some sort of queue for the stream of events.
A worker reads the queue and partitions the events based on time. For example you would have a partition for a certain day. This is akin to sharding. Many storage systems have support for this (e.g postgres partitions).
When you want a certain number of events over a period you union the partitions.
The partitioning is essentially hierarchal (minutes < hours < days etc) which means you can do tree like operations on them.
There are certain ways to store such events which is called time series data such that the partitioning index is automatic and fast. These are called TSDBs of which you can google for more info.
An example TSDB product would be influxdb.
Now going back to the fact that time (or at least how humans represent it) is organized tree like we can we can preform parallelization operations. This because a tree is DAG (directed acyclic graph). With a DAG you can do some analysis and basically recursively operate on the branches (also known as fork/join).
An example generic parallel storage product would citusdb.
Now of course this method has a massive draw back. It is expensive! Even if you make it fast by increasing the number of nodes you will have to pay for those nodes (distributed shards). An in theory the performance should scale linearly but in practice this does not happen (I will save you the details).
I think you will need to persist the data to the disk as
the query duration is super vague, and data might be loss due to some unforeseen circumstances like process killed, machine failure etc.
you can't keep all the events in memory due to memory
constraints(millions of events)
I would suggest using mysql as the data store with taking timestamp as one of the index key. But two events might have same timestamp. So make a composite index key with auto-increment id + timestamp.
Advantages of Mysql:
Super-reliable with replication
Support all kinds of CRUD operations and queries
On each query you can basically get the range of the timestamps as per your need.
First count the no. of events satisfying the query.
select count(*) from `events` where timestamp >= x and timestamp <=y.
If too many events satisfy the query, query them in batches.
select * from 'events' where timestamp >= x and timestamp <=y limit 1000 offset 0;
select * from 'events' where timestamp >= x and timestamp <=y limit 1000 offset 1000;
and so on.. till offset <= count of events matching the first query.

Resources