in my application a receive measurement data from a external library. It's about a dataset of 20 3D Points 30 times per second. I've realized some jitter in the data and I'm looking for a way to supress or better flatten peaks out of the data stream without significantly slowing down the whole system. Unfortunately I have to rely on the last dataset I receive so I can't record the data (lets say in a Queue) and filter out "bad" values. But I could record data and try fitting an incoming value to the recorded data.
Is there any approved and reliable algorithm for this problem?
Thanks in advance
FUX
Related
Data skew is something that hapen offen, that should be detected and treated correctly, I'm able to detect data skew in specific table using a groupby/count query in the joining key, however I have multiple joins in my application and doing that for each join can take time.
So is it possible to detect data skew directlly in the spark web ui which will saves me time ?
Data skew mean that you will have partitions that are significantly bigger than some other partitions.
For me, I usually check 2 things, In the stage tab, sort by decreasing duration, then click on tasks that are slow:
1- Check Summary Metrics which is one of the most important parts of the Spark UI. It gives you information about how your data is distributed among your partitions.
So to detect skew you can compare duration in Median and in Max columns, ideally the 2 values should be the same, when the difference between the two is bigger than defiantly there's a data skew, for example in the below picture:
Which means some tasks in that stage are taking too much time (31min) compared to other that takes only 1.1 minutes because of partitions size imbalance, the Min duration is also low which indicates that some partitions are nearly empty.
2- In the bottom of the stage You can find all tasks related to that stage, sort them by decreasing duration, then by Increasing duration, make sure that min duration and max duration are close if not than there are skewed in the you partitions, like in the picture below:
I'm gathering data from load sensors at about 50Hz. I might have 2-10 sensors running at a time. This data is stored locally but after a period of about a month it needs to be uploaded to the cloud. The data during this one second can vary quite significantly and is quite dynamic.
It's too much data to send because its going over GSM and signal will not always be great.
The most simplistic approach I can think of is to look at the 50 data points in 1 sec and reduce it to just enough data to make a box and whisker graph. Then, the data stored in the cloud could be used to create dashboards that look similar to how you look at stocks. This would at least show me the max, min, average and give some idea around the distribution of the load during that second.
This is probably over simplified though so I was wondering if there was a common approach to this problem in data science... take a dense set of data and reduce it to still capture the highlights and not lose its meaning.
Any help or ideas appreciated
I was asked this question in an interview. The details were that assume we are getting millions of events. Each event has a timestamp and other details. The systems design requires ability to enable end user to query most frequent records in last 10 minutes or 9 hours or may be 3 months.
Event can be seen as following
event_type: {CRUD + Search}
event_info: xxx
timestamp : ts...
The easiest way to to figure out this is to look at how other stream processing or map reduce libraries do this (and I have feeling your interviewers have seen these libraries). Its basically real time map reduce (you can lookup how that works as well).
I will outline two techniques for event processing. In reality most companies need to do both.
New school Stream processing (real time)
Lets assume for now they don't want the actual events but the more likely case of aggregates (I think that was the intent of your question)
An example stream processing project is pipelinedb (they have how it works on the bottom of their home page).
Events go into use a queue/ring buffer
A worker process reads those events in batches and rolls them up into partial buckets or window.
Finally there is combiner or reducer which takes the micro batches and actually does the updating. An example would be event counts. Because we are using a queue from above events come in ordered and depending on the queue we might be able to have multiple consumers that do the combing operation.
So if you want minute counts you would do rollups per minute and only store the sum of the events for that minute. This turns out to be fairly small space wise so you can store this in memory.
If you wanted those counts for month or day or even year you would just add up all the minute count buckets.
Now there is of course a major problem with this technique. You need to know what aggregates and pivots you would like to collect a priori.
But you get extremely fast look up of results.
Old school data warehousing (partitioning) and Map Reduce (batch processed)
Now lets assume they do want the actual events for a certain time period. This is expensive because if you store all the events in one place the lookup and retrieval is difficult. But if you use the fact that time is hierarchal you can store the events in a tree of tuples.
Reasons you would want the actual events is because you are doing adhoc querying and are willing to wait for the queries to perform.
You need some sort of queue for the stream of events.
A worker reads the queue and partitions the events based on time. For example you would have a partition for a certain day. This is akin to sharding. Many storage systems have support for this (e.g postgres partitions).
When you want a certain number of events over a period you union the partitions.
The partitioning is essentially hierarchal (minutes < hours < days etc) which means you can do tree like operations on them.
There are certain ways to store such events which is called time series data such that the partitioning index is automatic and fast. These are called TSDBs of which you can google for more info.
An example TSDB product would be influxdb.
Now going back to the fact that time (or at least how humans represent it) is organized tree like we can we can preform parallelization operations. This because a tree is DAG (directed acyclic graph). With a DAG you can do some analysis and basically recursively operate on the branches (also known as fork/join).
An example generic parallel storage product would citusdb.
Now of course this method has a massive draw back. It is expensive! Even if you make it fast by increasing the number of nodes you will have to pay for those nodes (distributed shards). An in theory the performance should scale linearly but in practice this does not happen (I will save you the details).
I think you will need to persist the data to the disk as
the query duration is super vague, and data might be loss due to some unforeseen circumstances like process killed, machine failure etc.
you can't keep all the events in memory due to memory
constraints(millions of events)
I would suggest using mysql as the data store with taking timestamp as one of the index key. But two events might have same timestamp. So make a composite index key with auto-increment id + timestamp.
Advantages of Mysql:
Super-reliable with replication
Support all kinds of CRUD operations and queries
On each query you can basically get the range of the timestamps as per your need.
First count the no. of events satisfying the query.
select count(*) from `events` where timestamp >= x and timestamp <=y.
If too many events satisfy the query, query them in batches.
select * from 'events' where timestamp >= x and timestamp <=y limit 1000 offset 0;
select * from 'events' where timestamp >= x and timestamp <=y limit 1000 offset 1000;
and so on.. till offset <= count of events matching the first query.
I am using codahale metrics for monitoring purposes. Lets say there is a spike in latency at some point and later there are no values reported due to attribute that there are no traffic, the value in the graph stays as is(I am using a histogram). At times it gives a notion that the spike remains and we might need to address it, but it actually means that no values are reported after that and hence the graph doesn't decay. Am I missing any config parameter in this case or is the behaviour expected?
The way we update the metrics is
metrics.processingTime.update(processingTime);
So, when there is no traffic, we don't update this metric.
I know that the histogram takes into consideration datapoints from the past (for an irregular period of time) in order to display a statistical image of the data.
When there are no new datapoints, only the outlier is taken into consideration and averaged on and on.
The meters have the same behavior, displaying the data through moving averages of 1,5,15 minutes.
The solution in the histogram case is to use HDRhistogram and flush it periodically.
I am trying to spread out data that is received in bursts. This means I have data that is received by some other application in large bursts. For each data entry I need to do some additional requests on some server, at which I should limit the traffic. Hence I try to spread up the requests in the time that I have until the next data burst arrives.
Currently I am using a token-bucket to spread out the data. However because the data I receive is already badly shaped I am still either filling up the queue of pending request, or I get spikes whenever a bursts comes in. So this algorithm does not seem to do the kind of shaping I need.
What other algorithms are there available to limit the requests? I know I have times of high load and times of low load, so both should be handled well by the application.
I am not sure if I was really able to explain the problem I am currently having. If you need any clarifications, just let me know.
EDIT:
I'll try to clarify the problem some more and explain, why a simple rate limiter does not work.
The problem lies in the bursty nature of the traffic and the fact, that burst have a different size at different times. What is mostly constant is the delay between each burst. Thus we get a bunch of data records for processing and we need to spread them out as evenly as possible before the next bunch comes in. However we are not 100% sure when the next bunch will come in, just aproximately, so a simple divide time by number of records does not work as it should.
A rate limiting does not work, because the spread of the data is not sufficient this way. If we are close to saturation of the rate, everything is fine, and we spread out evenly (although this should not happen to frequently). If we are below the threshold, the spreading gets much worse though.
I'll make an example to make this problem more clear:
Let's say we limit our traffic to 10 requests per seconds and new data comes in about every 10 seconds.
When we get 100 records at the beginning of a time frame, we will query 10 records each second and we have a perfect even spread. However if we get only 15 records we'll have one second where we query 10 records, one second where we query 5 records and 8 seconds where we query 0 records, so we have very unequal levels of traffic over time. Instead it would be better if we just queried 1.5 records each second. However setting this rate would also make problems, since new data might arrive earlier, so we do not have the full 10 seconds and 1.5 queries would not be enough. If we use a token bucket, the problem actually gets even worse, because token-buckets allow bursts to get through at the beginning of the time-frame.
However this example over simplifies, because actually we cannot fully tell the number of pending requests at any given moment, but just an upper limit. So we would have to throttle each time based on this number.
This sounds like a problem within the domain of control theory. Specifically, I'm thinking a PID controller might work.
A first crack at the problem might be dividing the number of records by the estimated time until next batch. This would be like a P controller - proportional only. But then you run the risk of overestimating the time, and building up some unsent records. So try adding in an I term - integral - to account for built up error.
I'm not sure you even need a derivative term, if the variation in batch size is random. So try using a PI loop - you might build up some backlog between bursts, but it will be handled by the I term.
If it's unacceptable to have a backlog, then the solution might be more complicated...
If there are no other constraints, what you should do is figure out the maximum data rate that you are comfortable with sending additional requests, and limit your processing speed according to that. Then monitor what is happening. If that gets through all of your requests quickly, then there is no harm . If its sustained level of processing is not fast enough, then you need more capacity.