Vowpalwabbit Base Learner - adaptive and normalize - vowpalwabbit

I'm trying to understand the base learner in vowpalwabbit. I understand the online gradient descent and the feature hashing. I'm trying to understand the adaptive and normalize features it uses. I understand the point of the features (to change the learning rate and even features), I was hoping to understand how they are programmed into vowpalwabbit. Can someone share the pseudo-code for these features or point me to them in the code base?

You won't find gd.cc particularly easy to read. It is designed for throughput. Having said that ...
Adaptive is based upon adagrad, which tries to adjust the effective learning rate per feature. To achieve this it accumulates a sum of squared gradients per feature, https://github.com/VowpalWabbit/vowpal_wabbit/blob/aa88627c9e9ffed6c0eea165ff85b04f0a22c0b7/vowpalwabbit/gd.cc#L502 .
Normalize is based upon a dimensional analysis argument. It tries to adjust the weights to match the feature scale. To achieve this it keeps a maximum absolute value seen so far for each feature, https://github.com/VowpalWabbit/vowpal_wabbit/blob/aa88627c9e9ffed6c0eea165ff85b04f0a22c0b7/vowpalwabbit/gd.cc#L507 .
Finally, you should try --coin, which is a new base learner in vw which can yield superior results with default settings.

Related

How `vw --audit` internally computes the weights of the features?

In vowpawabbit there is an option --audit that prints the weights of the features.
If we have a vw contextual bandit model with four arms, how is this feature weight created?
From what I understand vowpawabbit tries to fit one linear model to each arm.
So if weights were calculated using an average across all the arms, then they would correlate with getting a reward generally, instead of which features makes the model pick one variant from another.
I am interested know out how they are calculated to see how I can interpret the results obtained. I tried searching its Github repository but could not find anything meaningful.
I am interested know out how they are calculated to see how I can interpret the results obtained.
Unfortunately knowing the first does not lead to knowing the second.
Your question is concerned with contextual bandits, but it is important to note that interpreting model parameters is an issue that also occurs in supervised learning. Machine learning has made progress recently (i.e., my lifetime) largely by focusing concern on quality of predictions rather than meaningfulness of model parameters. In a blog post, Phoebe Wong outlines the issue while being entertaining.
The bottom line is that our models are not causal, so you simply cannot conclude because "the weight of feature X is for arm A is large means that if I were to intervene in the system and increase this feature value that I will get more reward for playing arm A".
We are currently working on tools for model inspection that leverage techniques such as permutation importance that will help you answer questions like "if I were to stop using a particular feature how would the frequency of playing each arm change for the trained policy". We're hoping that is helpful information.
Having said all that, let me try to answer your original question ...
In vowpawabbit there is an option --audit that prints the weights of the features.
If we have a vw contextual bandit model with four arms, how is this feature weight created?
The format is documented here. Assuming you are using --cb (not --cb_adf) then there are a fixed number of arms and so the offset field will increment over the arms. So for an example like
1:2:0.4 |foo bar
with --cb 4 you'll get an audit output with namespace of foo, feature of bar, and offset of 0, 1, 2, and 3.
Interpreting the output when using --cb_adf is possible but difficult to explain succinctly.
From what I understand vowpawabbit tries to fit one linear model to each arm.
Shorter answer: With --cb_type dm, essentially VW independently tries to predict the average reward for each arm using only examples where the policy played that arm. So the weight you get from audit at a particular offset N is analogous to what you would get from a supervised learning model trained to predict reward on a subset of the historical data consisting solely of times the historical policy played arm N. With other --cb_type settings the interpretation is more complicated.
Longer answer: "Linear model" refers to the representation being used. VW can incorporate nonlinearities into the model but let's ignore that for now. "Fit" is where some important details are. VW takes the partial feedback information of a CB problem (partial feedback = "for this example you don't know the reward of the arms not pulled") and reduces it to a full feedback supervised learning problem (full feedback = "for this example you do the reward of all arms"). The --cb_type argument selects the reduction strategy. There are several papers on the topic, a good place to start is Dudik et. al. and then look for papers that cite this paper. In terms of code, ultimately things are grounded here, but the code is written more for performance than intelligibility.

Differences in FP and FN rates between two algorithems

I am conducting binary classification using logistic regression with and without applying PCA. The application of PCA before logistic regression gives a higher accuracy and lower FNs in comparison to logistic regression alone. I would like to find out why this is happening, specifically why PCA produces less FNs. I have read that cost sensitivity analysis could help explain this, but I am not sure if this is correct. Any suggestions?
There is no need of fancy analysis to explain this behavior.
PCA is used just for "clean" the data by limiting its variance. Let me explain this concept with an example, and then I will turn back to your question.
In general, in any ML problem, the available samples are never sufficient in number to cover all the possible variety of the sample space. You can never have a dataset with all the possible human faces, with all the possible expressions, etc.
So, instead of using all the available features you engineer the features (the pixels, in this example) in a way that you get more meaningful higher level features. You can reduce the resolution of the pictures, as easy example; you will loose the informations on the pictures background, but your model will focus better on the most important part of the picture, i.e. the faces.
When you deal with tabular data, a technique similar to the resolution lowering is cutting off parts of the original features, and that's what PCA do: it keeps the most important components of the features, the "Principal Components", dropping the less important ones.
So, the model trained with PCA gives better results because, by cutting off part of the features, your model focus better on the most important part of your samples, and so it gains robustness against overfitting.
cheers

What type of algorithm should I use for forecasting with only very little historic data?

The problem is as follows:
I want to use a forecasting algorithm to predict heat demand of a not further specified household during the next 24 hours with a time resolution of only a few minutes within the next three or four hours and lower resolution within the following hours.
The algorithm should be adaptive and learn over time. I do not have much historic data since in the beginning I want the algorithm to be able to be used in different occasions. I only have very basic input like the assumed yearly heat demand and current outside temperature and time to begin with. So, it will be quite general and unprecise at the beginning but learn from its Errors over time.
The algorithm is asked to be implemented in Matlab if possible.
Does anyone know an apporach or an algortihm designed to predict sensible values after a short time by learning and adapting to current incoming data?
Well, this question is quite broad as essentially any algorithm for forcasting or data assimilation could do this task in principle.
The classic approach I would look into first would be Kalman filtering, which is a quite general approach at least once its generalizations to ensemble Filters etc. are taken into account (This is also implementable in MATLAB easily).
https://en.wikipedia.org/wiki/Kalman_filter
However the more important part than the actual inference algorithm is typically the design of the model you fit to your data. For your scenario you could start with a simple prediction from past values and add daily rhythms, influences of outside temperature etc. The more (correct) information you put into your model a priori the better your model should be at prediction.
For the full mathematical analysis of this type of problem I can recommend this book: https://doi.org/10.1017/CBO9781107706804
In order to turn this into a calibration problem, we need:
a model that predicts the heat demand depending on inputs and parameters,
observations of the heat demand.
Calibrating this model means tuning the parameters so that the model best predicts the heat demand.
If you go for Python, I suggest to use OpenTURNS, which provides several data assimilation methods, e.g. Kalman filtering (also called BLUE):
https://openturns.github.io/openturns/latest/user_manual/calibration.html

Binary classification of sensor data

My problem is the following: I need to classify a data stream coming from an sensor. I have managed to get a baseline using the
median of a window and I subtract the values from that baseline (I want to avoid negative peaks, so I only use the absolute value of the difference).
Now I need to distinguish an event (= something triggered the sensor) from the noise near the baseline:
The problem is that I don't know which method to use.
There are several approaches of which I thought of:
sum up the values in a window, if the sum is above a threshold the class should be EVENT ('Integrate and dump')
sum up the differences of the values in a window and get the mean value (which gives something like the first derivative), if the value is positive and above a threshold set class EVENT, set class NO-EVENT otherwise
combination of both
(unfortunately these approaches have the drawback that I need to guess the threshold values and set the window size)
using SVM that learns from manually classified data (but I don't know how to set up this algorithm properly: which features should I look at, like median/mean of a window?, integral?, first derivative?...)
What would you suggest? Are there better/simpler methods to get this task done?
I know there exist a lot of sophisticated algorithms but I'm confused about what could be the best way - please have a litte patience with a newbie who has no machine learning/DSP background :)
Thank you a lot and best regards.
The key to evaluating your heuristic is to develop a model of the behaviour of the system.
For example, what is the model of the physical process you are monitoring? Do you expect your samples, for example, to be correlated in time?
What is the model for the sensor output? Can it be modelled as, for example, a discretized linear function of the voltage? Is there a noise component? Is the magnitude of the noise known or unknown but constant?
Once you've listed your knowledge of the system that you're monitoring, you can then use that to evaluate and decide upon a good classification system. You may then also get an estimate of its accuracy, which is useful for consumers of the output of your classifier.
Edit:
Given the more detailed description, I'd suggest trying some simple models of behaviour that can be tackled using classical techniques before moving to a generic supervised learning heuristic.
For example, suppose:
The baseline, event threshold and noise magnitude are all known a priori.
The underlying process can be modelled as a Markov chain: it has two states (off and on) and the transition times between them are exponentially distributed.
You could then use a hidden Markov Model approach to determine the most likely underlying state at any given time. Even when the noise parameters and thresholds are unknown, you can use the HMM forward-backward training method to train the parameters (e.g. mean, variance of a Gaussian) associated with the output for each state.
If you know even more about the events, you can get by with simpler approaches: for example, if you knew that the event signal always reached a level above the baseline + noise, and that events were always separated in time by an interval larger than the width of the event itself, you could just do a simple threshold test.
Edit:
The classic intro to HMMs is Rabiner's tutorial (a copy can be found here). Relevant also are these errata.
from your description a correctly parameterized moving average might be sufficient
Try to understand the Sensor and its output. Make a model and do a Simulator that provides mock-data that covers expected data with noise and all that stuff
Get lots of real sensor data recorded
visualize the data and verify your assuptions and model
annotate your sensor data i. e. generate ground truth (your simulator shall do that for the mock data)
from what you learned till now propose one or more algorithms
make a test system that can verify your algorithms against ground truth and do regression against previous runs
implement your proposed algorithms and run them against ground truth
try to understand the false positives and false negatives from the recorded data (and try to adapt your simulator to reproduce them)
adapt your algotithm(s)
some other tips
you may implement hysteresis on thresholds to avoid bouncing
you may implement delays to avoid bouncing
beware of delays if implementing debouncers or low pass filters
you may implement multiple algorithms and voting
for testing relative improvements you may do regression tests on large amounts data not annotated. then you check the flipping detections only to find performance increase/decrease

Remove noisy and redundant features

I have extracted features from a video sequence based on facial markers as means and standard deviations of those markers over a video sequence. They need to be classified into four different classes based on those markers.
In all I have a feature set of around 260 features. How should I determine which features are noisy and redundant in my set. I read about it in some research papers and some of them used the plus l take away r algorithm that I found to be quite appropriate but in such algorithms they always rate one feature against the other and say its good or bad compared to it.
How do I rate my features to be good or bad? What criterion are used for that generally?
I researched a lot for a couple of days but found nothing clear cut and useful. Would be grateful for the help, Thanks.
Think of your 260 features as a basis for a 260 dimensional room. However, your basis-vectors are not normal to each other so they contain a lot of redundant information. You'd like to transform these vectors into a vector-set where all vectors are normal to each other, thus minimizing the dimensions without losing (much) information.
This is what Principal component analysis does.
Linear discriminant analysis may also be of interest to you.
You can use pca or you can train some classifiers, and after this you loop all over yours features adding a big value to each feature, testing if this alteration changes the precision of the classifier, if not, you can remove this feature, after remove all the redundat features, and then retrain your classifiers!
Its a good ideia to train not one classifier but a lot of them, and them make your prediction based on votes, you can user MODE function in matlab to do this!
Use classification rate to determine a subset of feature how much good. You have 260 feature and then have 2^260 subset, this is too much! and search in this space is very difficult. Thus it's better to remove some feature by Filter method (for example FA, t-test, fisher and ...) and then use your search method to find best subset of feature.
Plus l take away r algorithm (or other search algorithm) find various subset and rate it (in this stage use classification rate) and at last specify which subset is better.

Resources