h2o.ai Platt Scaling calibration

h2o.ai Platt Scaling calibration - h2o

I noticed a relatively recend add to the h2o.ai suite, the ability to perform supplementary Platt Scaling to improve the calibration of output probabilities. (See calibrate_model in h2o manual.) Nevertheless few guidance is avaiable on the online help docs. In particular I wonder whether when Platt Scaling is enabled:
How it affects the models' leaderboard? That is, is the platt scaling calculated after the ranking metric or before?
How it affects computing performance?
Can the calibration_frame be the same as validation_frame or should not (both under a computation or theoretical point of view)?
Thanks in advance

Calibration is a post-processing step run after the model finishes. Therefore it doesn't affect the leaderboard and and it has no effect on the training metrics either. It adds 2 more columns to the scored frame (with calibrated predictions).
This article provides guidance how to construct a calibration frame:
Split dataset into test and train
Split the train set into model training and calibration.
It also says:
The most important step is to create a separate dataset to perform calibration with.
I think the calibration frame should be used only for calibration, and hence distinct from the validation frame. The conservative answer is that they should be separate -- when you use a validation frame for early stopping or any internal model tuning (e.g. lambda search in H2O GLM), that validation frame becomes an extension of the "training data" so it's kind of off-limits at that point. However you could try both versions and directly observe what the effect is, then make a decision. Here's some additional guidance from the article:
"How much data to use for calibration will depend on the amount of data you have available. The calibration model will generally only be fitting a small number of parameters (so you do not need a huge volume of data). I would aim for around 10% of your training data, but at a minimum of at least 50 examples."

Related

Time Series Anomaly Detection from Data vs Image

I was assigned with project to do anomaly detection on for our company KPI. I googled and found AnomalyDetection by Twitter. There was an idea from my colleague to do the anomaly detection on the graph images (comparing with previous week images to identify anomaly points) instead of using time-series raw data.
I am not familiar with the Anomaly Detection, anyone here experienced and able to advice which one is better (Anomaly Detection from data or image) in term of:
1. Accuracy
2. Storage
3. Processing

Advantages:
Data-agnostic. Can theoretically be ran on anything where one can get an image/visualization out.
Image models are relatively well understood.
Pretrained models are available.
Disadvantages:
Requires much more data to learn useful model.
The image pixel space is much more complicated than the time-series it represents. Probably at least 100x.
Requires much more compute power. Both at training time, and at prediction time. Probably at least 100x.
Requires much more storage for datasets. Probably at least 100x.
Sensitive to changes in visualization.
A change in tickmarks or font for example would be an anomaly. Even a change in image compression may impact, if not controlled for.
Lose explain-ability. May be hard to know why a certain image is anomaly, even for simple cases like a mean shift.
Much more complex model setup and infrastructure needed
For an application like Anomaly Detection on Time Series on metrics, I would not recommend doing it. I am not even sure I have seen it studied.
I think it is unlikely that a high performing Anomaly Detection system for metrics can be built effectively with image processing on graphs.
Anomalies are typically quite rare, which means that it is a "low data" scenario. But also many anomalies are quite simple, and can be detected with simple methods - as basic as well chosen thresholds can go a long way. Using image processing does not help with any of these challenges, in fact it is worse in most regards.

What type of algorithm should I use for forecasting with only very little historic data?

The problem is as follows:
I want to use a forecasting algorithm to predict heat demand of a not further specified household during the next 24 hours with a time resolution of only a few minutes within the next three or four hours and lower resolution within the following hours.
The algorithm should be adaptive and learn over time. I do not have much historic data since in the beginning I want the algorithm to be able to be used in different occasions. I only have very basic input like the assumed yearly heat demand and current outside temperature and time to begin with. So, it will be quite general and unprecise at the beginning but learn from its Errors over time.
The algorithm is asked to be implemented in Matlab if possible.
Does anyone know an apporach or an algortihm designed to predict sensible values after a short time by learning and adapting to current incoming data?

Well, this question is quite broad as essentially any algorithm for forcasting or data assimilation could do this task in principle.
The classic approach I would look into first would be Kalman filtering, which is a quite general approach at least once its generalizations to ensemble Filters etc. are taken into account (This is also implementable in MATLAB easily).
https://en.wikipedia.org/wiki/Kalman_filter
However the more important part than the actual inference algorithm is typically the design of the model you fit to your data. For your scenario you could start with a simple prediction from past values and add daily rhythms, influences of outside temperature etc. The more (correct) information you put into your model a priori the better your model should be at prediction.
For the full mathematical analysis of this type of problem I can recommend this book: https://doi.org/10.1017/CBO9781107706804

In order to turn this into a calibration problem, we need:
a model that predicts the heat demand depending on inputs and parameters,
observations of the heat demand.
Calibrating this model means tuning the parameters so that the model best predicts the heat demand.
If you go for Python, I suggest to use OpenTURNS, which provides several data assimilation methods, e.g. Kalman filtering (also called BLUE):
https://openturns.github.io/openturns/latest/user_manual/calibration.html

Keras «Powerful image classification with little data»: disparity between training and validation

I followed this post and first made it work on the dataset «Cats vs dogs». Then I substituted this set with my own images, which show the presence of an object vs the absence of that object. My dataset is even smaller than the one in the post. I only have 496 images containing that object for training and 160 images with that object for validation. For the «absent» class I have numerous samples (without that object in an image).
So far I didn't try class_weight to tackle the imbalanced data problem. I just randomly choose 496 and 160 images without that object for training and validation, respectively. Basically, I do a two class image classification with a smaller dataset using the techniques in this post. Thus I expected a worse performance in comparison due to the insufficient data. But the actual problem is that the performance is not convergent as shown in the figures.
Could you tell me possible reasons that lead to the unconvergence? I guess the problem is related to my dataset as the model works perfectly for «cats vs dogs». But I don't know how to address it. Are there any good techniques to make it convergent?
Thank you.
This performance plot is based on VGG16, keeping all layers up to fully connected layer and training a small fully connected layer with 256 neurons.
This performance plot is also based on VGG16, but using 128 neurons instead of 256 neurons. Also I set epochs to 80.
Based on the suggestions provided so far, I'm thinking to have a customized convnet model to fight the overfitting problem. But how to do this? One of my worries is that a model with fewer layers will downgrade the performance for training. Any guidelines to customize a good model for little data? Thank you.
Updates:
Now I think I know the half reason that leads to the unconvergent problem. You know, Actually I only have 100+ images. The rest images are downloaded from Flickr. I thought those images having centric objects and better quality will work for the model. But later on I found they can not contribute to the accuracy and even worse the output class probabilities. After removing these downloaded images, the performance is bumping upward a little and the uncovergency is gone. Note I only use 64*2 images for training and 48*2 images for testing. Also I found the image augmentation could not improve the performance for my dataset. Without image augmentation, the training accuracy could reach 1. But if I add some image augmentation, the training accuracy is only around 85%. Did somebody have such experience? Why doesn't data augmentation always work? Because our specific dataset? Thank you very much.

Your model is working great, but it's "overfitting". It means it's capable of memorizing all your training data without really "thinking". That leads to great training results and bad test results.
Common ways to avoid overfitting are:
More data - If you have little data, the chance of overfitting increases
Less units/layers - Make the model less capable, so it will stop memorizing and start thinking.
Add "dropouts" to your layers (something that randomly discards part of the results to prevent the model from being too powerful)
Do more layers mean more power and performance?
If by performance you mean capability of learning, yes. (If you mean "speed", no)
Yes, more layers mean more power. But too much power leads to overfitting: the model is so capable that it can memorize training data.
So there is an optimal point:
A model that is not very capable will not give you the proper results (both training and test results will be bad)
A model that is too capable will memorize the training data (excellent training results, but bad test results)
A balanced model will learn the right things (good training and test results)
That's exactly why we use test data, it's data that is not presented for training, so the model doesn't learn from the test data.

Binary classification of sensor data

My problem is the following: I need to classify a data stream coming from an sensor. I have managed to get a baseline using the
median of a window and I subtract the values from that baseline (I want to avoid negative peaks, so I only use the absolute value of the difference).
Now I need to distinguish an event (= something triggered the sensor) from the noise near the baseline:
The problem is that I don't know which method to use.
There are several approaches of which I thought of:
sum up the values in a window, if the sum is above a threshold the class should be EVENT ('Integrate and dump')
sum up the differences of the values in a window and get the mean value (which gives something like the first derivative), if the value is positive and above a threshold set class EVENT, set class NO-EVENT otherwise
combination of both
(unfortunately these approaches have the drawback that I need to guess the threshold values and set the window size)
using SVM that learns from manually classified data (but I don't know how to set up this algorithm properly: which features should I look at, like median/mean of a window?, integral?, first derivative?...)
What would you suggest? Are there better/simpler methods to get this task done?
I know there exist a lot of sophisticated algorithms but I'm confused about what could be the best way - please have a litte patience with a newbie who has no machine learning/DSP background :)
Thank you a lot and best regards.

The key to evaluating your heuristic is to develop a model of the behaviour of the system.
For example, what is the model of the physical process you are monitoring? Do you expect your samples, for example, to be correlated in time?
What is the model for the sensor output? Can it be modelled as, for example, a discretized linear function of the voltage? Is there a noise component? Is the magnitude of the noise known or unknown but constant?
Once you've listed your knowledge of the system that you're monitoring, you can then use that to evaluate and decide upon a good classification system. You may then also get an estimate of its accuracy, which is useful for consumers of the output of your classifier.
Edit:
Given the more detailed description, I'd suggest trying some simple models of behaviour that can be tackled using classical techniques before moving to a generic supervised learning heuristic.
For example, suppose:
The baseline, event threshold and noise magnitude are all known a priori.
The underlying process can be modelled as a Markov chain: it has two states (off and on) and the transition times between them are exponentially distributed.
You could then use a hidden Markov Model approach to determine the most likely underlying state at any given time. Even when the noise parameters and thresholds are unknown, you can use the HMM forward-backward training method to train the parameters (e.g. mean, variance of a Gaussian) associated with the output for each state.
If you know even more about the events, you can get by with simpler approaches: for example, if you knew that the event signal always reached a level above the baseline + noise, and that events were always separated in time by an interval larger than the width of the event itself, you could just do a simple threshold test.
Edit:
The classic intro to HMMs is Rabiner's tutorial (a copy can be found here). Relevant also are these errata.

from your description a correctly parameterized moving average might be sufficient
Try to understand the Sensor and its output. Make a model and do a Simulator that provides mock-data that covers expected data with noise and all that stuff
Get lots of real sensor data recorded
visualize the data and verify your assuptions and model
annotate your sensor data i. e. generate ground truth (your simulator shall do that for the mock data)
from what you learned till now propose one or more algorithms
make a test system that can verify your algorithms against ground truth and do regression against previous runs
implement your proposed algorithms and run them against ground truth
try to understand the false positives and false negatives from the recorded data (and try to adapt your simulator to reproduce them)
adapt your algotithm(s)
some other tips
you may implement hysteresis on thresholds to avoid bouncing
you may implement delays to avoid bouncing
beware of delays if implementing debouncers or low pass filters
you may implement multiple algorithms and voting
for testing relative improvements you may do regression tests on large amounts data not annotated. then you check the flipping detections only to find performance increase/decrease

Algorithms needed on filtering the noise caused by the vibration

For example you measure the data coming from some device, it can be a mass of the object moving on the bridge. Because it is moving the mass will give data which will vibrate in some amplitude depending on the mass of the object. Bigger the mass - bigger the vibrations.
Are there any methods for filtering such kind of noise from that data?
May be using some formulas of vibrations? Have no idea what kind of formulas or algorithms (filters) can be used here. Please suggest anything.
EDIT 2:
Better picture, I just draw it for better understanding:
Not very good picture. From that graph you can see that the frequency is the same every
time, but the amplitude chanbges periodically. Something like that I have when there are no objects on the moving road. (conveyer belt). vibrating near zero value.
When the object moves, I there are the same waves with changing amplitude.
The graph can tell that there may be some force applying to the system and which produces forced occilations. So I am interested in removing such kind of noise. I do not know what force causes such occilations. Soon I hope I will get some data on the non moving road with and without object on it for comparison with moving road case.

What you have in your last plot is basically an amplitude modulated oscillation coming from a function like:
f[x] := 10 * (4 + Sin[x]) * Sin[80 * x]
The constants have been chosen to match your plot (using just a rule of thumb)
The Plot of this function is
That isn't "noise" (although may be some noise is there too), but can be filtered easily.
Let's see your data for the static and moving payloads ....
Edit
Based on your response to several comments, and based in my previous experience with weighting devices:
You are interfacing the physical world, not just getting input from a mouse and keyboard. It is very important for you understand the device, how it works and how it is designed.
You need a calibration procedure. You have to use several master weights to be sure that the device is working properly and linearly in the whole scale, and that the static case is measured much better than your dynamic needs.
You'll not be able to predict if you can measure with several loads in the conveyor until you do some experiments and look very carefully at the resulting plots
You need to be sure that a load placed anywhere in the conveyor shows the same reading. Or at least you should be able to correlate reading and position.
As I said before, you need a lot of info, and it seems that is not available. I always worked as a team with the engineers designing the device.
Don't hesitate to add more info ...

Have you tried filters with lowpass characteristics? There are different approaches for smoothing data (i.e. Savitzky-Golay, Gauss, moving average) but often, a simple N-point median filter is already sufficient.
It really depends on what you're after.

Take a look at this book:
The Scientist and Engineer's Guide to Digital Signal Processing
You can download it for free. In particular, check chapters 14 and 15.

If the frequency changes with mass and you're trying to measure mass, why not measure the frequency of the oscillations and use that as your primary measure?
Otherwise you need a notch filter which is tunable - figure out the frequency of the "noise" and tune the notch filter to that.
Another book to try is Lyons Understanding Digital Signal Processing

In order to smooth the signal, I'd average the previous 2 * n samples where n is the maximum expected wavelength of the vibrations.
This should cause most of the noise to be eliminated.

If you have some idea of the range of frequencies, you could do a simple average as long as the measurement period were sufficiently long to give you the level of accuracy you want to achieve. The more wavelengths worth of data you average against, the smaller the ratio of contributed error from a partial wavelength.
I'd suggest first simulating/modeling this in software like Matlab.
Data you'll need to consider:
The expected range of vibration frequencies
The measurement accuracy you want to achieve
The expected range of mass you'll want to measure
The function of mass to vibration amplitude

You should be able to apply the same principles as noise-cancelling microphones: put two sensors out, then subtract the secondary sensor's (farther away from the good signal source) signal from the primary sensor's (closer to the good signal source) signal.
Obviously, this works best if the "noise" will reach both sensors fairly equally while the "signal" reaches the primary sensor much more strongly.
For things like sound, this is pretty easy to do in the sensor itself, which makes your software a lot easier and more performant. Depending on what you're measuring, this might be easier to do with multiple sets of hardware and doing the cancellation in software.

If you can characterize the frequency spectra of the unwanted vibration noise, you might be able to synthesize a set of (near) minimum phase notch or band reject filter(s) to allow you to acquire your desired signal at your desired S/N ratio with minimized latency or data set size.

Filtering noisy digital signals is straight forward, as previous posters have noted. There are lots of references. You have not however stated what your objectives are clearly, so we cannot point you into a good direction. Are you looking for a single measurement of a single object on a bridge? [Then see other answers].
Are you monitoring traffic on this bridge and weighing each entity as it passes by? Then you need to determine when entities are on the sensor and when they are not. Typically, as long as the sensor's noise floor is significantly lower than the signal you're measuring this can be accomplished by simple thresholding.
Are you trying to measure the vibrations of the bridge caused by other vehicles? In which case you need either a more expensive sensor if you're having problems doing this, or a clearer measuring objective.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio