Offset between predictions and observed values

Offset between predictions and observed values - time

I'm trying to forecast (1 hour ahead) steam consumption using 11 continuous variables. But my predictions seems to be offset from observed values.
All data cleaning process was done and during feature engineering process I generated 140 variables including:
Window features using mean, median, standard deviation, minimum and maximum (with windows size 30, 60 and 90 for each variable)
Lags of 30, 60 and 90 for each variable
Extracted day, hour and minute from datetime index
I'm training a lgbm regressor to do the forecast task.
The model MAPE on unseen dataset was 1.8%. But my problem is an offset between predicted and observed values. I would like to know if there is any strategy I can use to remove or mitigate this offset.
This picture shows the offset of predictions in unseen data.

Related

Algorithm / data structure for rate of change calculation with limited memory

Certain sensors are to trigger a signal based on the rate of change of the value rather than a threshold.
For instance, heat detectors in fire alarms are supposed to trigger an alarm quicker if the rate of temperature rise is higher: A temperature rise of 1K/min should trigger an alarm after 30 minutes, a rise of 5K/min after 5 minutes and a rise of 30K/min after 30 seconds.
 
I am wondering how this is implemented in embedded systems, where resources are scares. Is there a clever data structure to minimize the data stored?
 
The naive approach would be to measure the temperature every 5 seconds or so and keep the data for 30 minutes. On these data one can calculate change rates over arbitrary time windows. But this requires a lot of memory.
 
I thought about small windows (e.g. 10 seconds) for which min and max are stored, but this would not save much memory.

From a mathematical point of view, the examples you have described can be greatly simplified:
1K/min for 30 mins equals a total change of 30K
5K/min for 5 mins equals a total change of 25K
Obviously there is some adjustment to be made because you have picked round numbers for the example, but it sounds like what you care about is having a single threshold for the total change. This makes sense because taking the integral of a differential results in just a delta.
However, if we disregard the numeric example and just focus on your original question then here are some answers:
First, it has already been mentioned in the comments that one byte every five seconds for half an hour is really not very much memory at all for almost any modern microcontroller, as long as you are able to keep your main RAM turned on between samples, which you usually can.
If however you need to discard the contents of RAM between samples to preserve battery life, then a simpler method is just to calculate one differential at a time.
In your example you want to have a much higher sample rate (every 5 seconds) than the time you wish to calculate the delta over (eg: 30 mins). You can reduce your storage needs to a single data point if you make your sample rate equal to your delta period. The single previous value could be stored in a small battery retained memory (eg: backup registers on STM32).
Obviously if you choose this approach you will have to compromise between accuracy and latency, but maybe 30 seconds would be a suitable timebase for your temperature alarm example.
You can also set several thresholds of K/sec, and then allocate counters to count how many consecutive times the each threshold has been exceeded. This requires only one extra integer per threshold.

In signal processing terms, the procedure you want to perform is:
Apply a low-pass filter to smooth quick variations in the temperature
Take the derivative of its output
The cut-off frequency of the filter would be set according to the time frame. There are 2 ways to do this.
You could apply a FIR (finite impulse response) filter, which is a weighted moving average over the time frame of interest. Naively, this requires a lot of memory, but it's not bad if you do a multi-stage decimation first to reduce your sample rate. It ends up being a little complicated, but you have fine control over the response.
You could apply in IIR (Infinite impulse response) filter, which utilizes feedback of the output. The exponential moving average is the simplest example of this. These filters require far less memory -- only a few samples' worth, but your control over the precise shape of the response is limited. A classic example like the Butterworth filter would probably be great for your application, though.

Best algorithm for threshold identitication

Assume I have huge set of data about a system idle time.
Day 1 - 5 mins
Day 2 - 3 mins
Day 3 - 7 mins
...
Day 'n' - 'k' mins
We can assume that even though the idletime is random, the pattern repeats.
Using this as a training data, is it possible for me to identify the idle time behavior of the system. With that, can a abnormality be predicted
Which algorithm would best suit for this purpose
I tried to fit in regression, but it can just answer me " What is the expected idle time today "
But what I want to do is. When the idle time goes away from the pattern, it has to be detected.
Edit:
Or does it make sense to predict for the current day only. i.e Today the expected idle time is 'x' mins. Tomorrow it may differ

I would try a Fourier Transformation and have a look if your system behaves in a periodic way (this would mean there are some peaks in the frequency domain).
Than get rid of the frequencies with low values and use the rest to predict the system behavior in the future.
If the real behavior differs a lot from the prediction that is what you want to detect.
wikipedia: Fast Fourier Transformation

How do I weight my rate by sample size (in Datadog)?

So I have an ongoing metric of events. They are either tagged as success or fail. So I have 3 numbers; failed, completed, total. This is easily illustrated (in Datadog) using a stacked bar graph like so:
So the dark part are the failures. And by looking at the y scale and the dashed red line for scale, this easily tells a human if the rate is a problem and significant. Which to mean means that I have a failure rate in excess of 60%, over at least some time (10 minutes?) and that there are enough events in this period to consider the rate exceptional.
So I am looking for some sort of formula that starts with: failures divided by total (giving me a score between 0 and 1) and then multiplies this somehow again with the total and some thresholds that I decide means that the total is high enough for me to get an automated alert.
For extra credit, here is the actual Datadog metric that I am trying to get to work:
(sum:event{status:fail}.rollup(sum, 300) / sum:event{}.rollup(sum,
300))
And I am watching for 15 minutes and alert of score above 0.75. But I am not sure about sum, count, avg, rollup or count. And ofc this alert will send me mail during the night when the total events goes low enough to were a high failure rate isn't proof of any problem.

Log data reduction for variable bandwidth data link

I have an embedded system which generates samples (16bit numbers) at 1 milli second intervals. The variable uplink bandwidth can at best transfer a sample every 5ms, so I am
looking for ways to adaptively reduce the data rate while minimizing the loss
of important information -- in this case the minimum and maximum values in a time interval.
A scheme which I think should work involves sparse coding and a variation of lossy compression. Like this:
The system will internally store the min and max values during a 10ms interval.
The system will internally queue a limited number (say 50) of these data pairs.
No loss of min or max values is allowed but the time interval in which they occur may vary.
When the queue gets full, neighboring data pairs will be combined starting at the end of the queue so that the converted min/max pairs now represent 20ms intervals.
The scheme should be iterative so that further interval combining to 40ms, 80ms etc is done when necessary.
The scheme should be linearly weighted across the length of the queue so that there is no combining for the newest data and maximum necessary combining of the oldest data.
For example with a queue of length 6, successive data reduction should cause the data pairs to cover these intervals:
initial: 10 10 10 10 10 10 (60ms, queue full)
70ms: 10 10 10 10 10 20
80ms: 10 10 10 10 20 20
90ms: 10 10 20 20 20 20
100ms: 10 10 20 20 20 40
110ms: 10 10 20 20 40 40
120ms: 10 20 20 20 40 40
130ms: 10 20 20 40 40 40
140ms: 10 20 20 40 40 80
New samples are added on the left, data is read out from the right.
This idea obviously falls into the categories of lossy-compression and sparse-coding.
I assume this is a problem that must occur often in data logging applications with limited uplink bandwidth therefore some "standard" solution might have emerged.
I have deliberately simplified and left out other issues such as time stamping.
Questions:
Are there already algorithms which do this kind of data logging? I am not looking for the standard, lossy picture or video compression algos but something more specific to data logging as described above.
What would be the most appropriate implementation for the queue? Linked list? Tree?

The term you are looking for is "lossy compression" (See: http://en.wikipedia.org/wiki/Lossy_compression ). The optimal compression method depends on various aspects such as the distribution of your data.

As i understand you want to transmit min() and max() of all samples in a timeperiod.
eg. you want transmit min/max every 10ms with taking samples every 1ms?
if you do not need the individual samples you simply compare them after each sampling
i=0; min=TYPE_MAX; max=TYPE_MIN;// First sample will always overwrite the initial values
while true do
sample = getSample();
if min>sample then
min=sample
if max<sample then
max=sample
if i%10 == 0 then
send(min, max);
// if each period should be handled seperatly: min=TYPE_MAX; max=TYPE_MIN;
done
you can also save bandwidth with sending data only on changes (depends on sample data: if they dont change very quick you will save a lot)

Define a combination cost function that matches your needs, e.g. (len(i) + len(i+1)) / i^2, then iterate the array to find the "cheapest" pair to replace.

Algorithm to decompose set of timestamps into subsets with even temporal spacing

I have a dataset containing > 100,000 records where each record has a timestamp.
This dataset has been aggregated from several "controller" nodes which each collect their data from a set of children nodes. Each controller collects these records periodically, (e.g. once every 5 minutes or once every 10 minutes), and it is the controller that applies the timestamp to the records.
E.g:
Controller One might have 20 records timestamped at time t, 23 records timestamped at time t + 5 minutes, 33 records at time t + 10 minutes.
Controller Two might have 30 records timestamped at time (t + 2 minutes) + 10 minutes, 32 records timestamped at time (t + 2 minutes) + 20 minutes, 41 records timestamped at time (t + 2 minutes) + 30 minutes etcetera.
Assume now that the only information you have is the set of all timestamps and a count of how many records appeared at each timestamp. That is to say, you don't know i) which sets of records were produced by which controller, ii) the collection interval of each controller or ii) the total number of controllers. Is there an algorithm which can decompose the set of all timestamps into individual subsets such that the variance in difference between consecutive (ordered) elements of each given subset is very close to 0, while adding any element from one subset i to another subset j would increase this variance? Keep in mind, for this dataset, a single controller's "periodicity" could fluctuate by +/- a few seconds because of CPU timing/network latency etc.
My ultimate objective here is to establish a) how many controllers there are and b) the sampling interval of each controller. So far I've been thinking about the problem in terms of periodic functions, so perhaps there are some decomposition methods from that area that could be useful.
The other point to make is that I don't need to know which controller each record came from, I just need to know the sampling interval of each controller. So e.g. if there were two controllers that both started sampling at time u, and one sampled at 5-minute intervals and the other at 50-minute intervals, it would be hard to separate the two at the 50-minute mark because 5 is a factor of 50. This doesn't matter, so long as I can garner enough information to work out the intervals of each controller despite these occasional overlaps.

One basic approach would be to perform an FFT decomposition (or, if you're feeling fancy, a periodogram) of the dataset and look for peaks in the resulting spectrum. This will give you a crude approximation of the periods of the controllers, and may even give you an estimate of their number (and by looking at the height of the peaks, it can tell you how many records were logged).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio