Concept on "best constant's loss" in vowpal wabbit's output, and the stated rule of thumb in tutorial - vowpalwabbit

I am trying to understand vowpal a bit more and came across this statement on the Linear Regression tutorial. (https://vowpalwabbit.org/tutorials/getting_started.html)
"At the end, some more straightforward totals are printed. The best constant and best constant's loss only work if you are using squared loss. Squared loss is the Vowpal Wabbit default. They compute the best constant’s predictor and the loss of the best constant predictor.
If average loss is not better than best constant's loss, something is wrong. In this case, we have too few examples to generalize."
Based on that context, I have 2 related questions:
Is the best constant's loss based on the loss of the null model in linear regression?
Is the general rule of thumb for "average loss" not being better than "best constant's loss" applicable to all loss functions (since the statement does state that the "best constant" only works for the default squared loss function)?
Thanks in advance for any responses!

Is the best constant's loss based on the loss of the null model in linear regression?
If by null-model you mean the model which always predicts the best-constant, then yes.
Is the general rule of thumb for "average loss" not being better than "best constant's loss" applicable to all loss functions?
Yes. If by always using the same prediction (some best constant applicable to a given loss-function) you are doing better than the learned model, it means that the learned model is inferior to the simplest possible model. The simplest model for a given loss-function, is always predicting the same (best constant) result, ignoring the input-features in the data.
One of the most common cases for a learned model being inferior to the best-constant model, is a too small data-set. When the data-set is small, the learning process didn't have a chance to fully converge yet. This is also known as under-fitting.
How is the best constant calculated (for completeness)?
In the case of linear-regression (least-squares hyperplane, the vw --loss_function squared, which is the default) the best constant is the simple average (aka mean) of the labels. This minimizes the squared-loss.
In the case of quantile-loss (aka absolute-error, vw --loss_function quantile) the best constant is the median of the labels and it minimizes the sum-of-distances between the labels and the prediction.

Related

Adjust predicted probability after smote

I have an imbalance data set and I used smote to oversample the minority class and undersample the majority class.
now, I want to check the test AUC using predict_proba of the model.
I have two questions:
1. Do I have to correct the probability if I am comparing AUCs?
2. How can I correct it (a combination of undersampling and oversampling!)
(1) The good news is no, you don't have to correct when comparing AUC. The resampling correction is a strictly increasing function of the uncorrected score, so it doesn't change the order of cases, so the ROC is exactly the same.
(2) There is a simple formula for correcting after under/over-sampling, I forget what it is, I'm pretty sure a web search will find it.
Further discussion is best suited to stats.stackexchange.com.

Measure expected time to execute any function

Often in Machine Learning, training consumes a lot of time and though, this is measurable, but only after the end of training.
Is there some method which can be used to estimate the time it might take to complete the training(or generally, any function), something like a before_call?
Sure it depends on the machine and more on the inputs but an approximation based on all the IO the algorithm will call, based on simple inputs and then scaled to the size of the actual inputs. Something like this?
PS - JS, Ruby or any other OO language
PPS - I see that in Oracle there is a way, described here. That is cool. How is it done?
Let Ci be the complexity of the i'th learning step. Let Pi be the probability that the thing to be learned will be learned at or before the i'th step. Let k be the step where Pk > 0.5.
In this case the complexity, C is
C = sum(Pi, i=1,k)
The problem is that k is difficult to find. In this case it is a good idea to have a stored set of previously learned similar patterns and compute their average step number, which will be the median. If the set is large-enough, it will be pretty accurate.
Pi = the number of instances when things were learned by step i / total number of instances
In case if you did not set any time/number of steps limits (that will be trivial), there is no way to estimate required time in general.
For example, neural network training basically is a problem of global high-dimensional optimization. In this task your are trying to find such set of parameters to a given loss function, that it will return minimal error. This task belong to NP-complete class and is very difficult to solve. Common approach is to randomly change some parameters by a small value in hope that it will improve overall performance. It works great in practice, but required runtime can vary greatly from problem to problem. I would recommend to read about NP-completness, stochastic gradient descent and optimisation in general.

Why should we compute the image mean when we train CNNs?

When I use caffe for image classification, it often computes the image mean. Why is that the case?
Someone said that it can improve the accuracy, but I don't understand why this should be the case.
Refer to image whitening technique in Deep learning. Actually it has been proved that it improve the accuracy but not widely used.
To understand why it helps refer to the idea of normalizing data before applying machine learning method. which helps to keep the data in the same range. Actually there is another method now used in CNN which is Batch normalization.
Neural networks (including CNNs) are models with thousands of parameters which we try to optimize with gradient descent. Those models are able to fit a lot of different functions by having a non-linearity φ at their nodes. Without a non-linear activation function, the network collapses to a linear function in total. This means we need the non-linearity for most interesting problems.
Common choices for φ are the logistic function, tanh or ReLU. All of them have the most interesting region around 0. This is where the gradient either is big enough to learn quickly or where a non-linearity is at all in case of ReLU. Weight initialization schemes like Glorot initialization try to make the network start at a good point for the optimization. Other techniques like Batch Normalization also keep the mean of the nodes input around 0.
So you compute (and subtract) the mean of the image so that the first computing nodes get data which "behaves well". It has a mean of 0 and thus the intuition is that this helps the optimization process.
In theory, a network can be able to "subtract" the mean by itself. So if you train long enough, this should not matter too much. However, depending on the activation function "long enough" can be important.

Continuous vs Discrete artificial neural networks

I realize that this is probably a very niche question, but has anyone had experience with working with continuous neural networks? I'm specifically interested in what a continuous neural network may be useful for vs what you normally use discrete neural networks for.
For clarity I will clear up what I mean by continuous neural network as I suppose it can be interpreted to mean different things. I do not mean that the activation function is continuous. Rather I allude to the idea of a increasing the number of neurons in the hidden layer to an infinite amount.
So for clarity, here is the architecture of your typical discreet NN:
(source: garamatt at sites.google.com)
The x are the input, the g is the activation of the hidden layer, the v are the weights of the hidden layer, the w are the weights of the output layer, the b is the bias and apparently the output layer has a linear activation (namely none.)
The difference between a discrete NN and a continuous NN is depicted by this figure:
(source: garamatt at sites.google.com)
That is you let the number of hidden neurons become infinite so that your final output is an integral. In practice this means that instead of computing a deterministic sum you instead must approximate the corresponding integral with quadrature.
Apparently its a common misconception with neural networks that too many hidden neurons produces over-fitting.
My question is specifically, given this definition of discrete and continuous neural networks, I was wondering if anyone had experience working with the latter and what sort of things they used them for.
Further description on the topic can be found here:
http://www.iro.umontreal.ca/~lisa/seminaires/18-04-2006.pdf
I think this is either only of interest to theoreticians trying to prove that no function is beyond the approximation power of the NN architecture, or it may be a proposition on a method of constructing a piecewise linear approximation (via backpropagation) of a function. If it's the latter, I think there are existing methods that are much faster, less susceptible to local minima, and less prone to overfitting than backpropagation.
My understanding of NN is that the connections and neurons contain a compressed representation of the data it's trained on. The key is that you have a large dataset that requires more memory than the "general lesson" that is salient throughout each example. The NN is supposedly the economical container that will distill this general lesson from that huge corpus.
If your NN has enough hidden units to densely sample the original function, this is equivalent to saying your NN is large enough to memorize the training corpus (as opposed to generalizing from it). Think of the training corpus as also a sample of the original function at a given resolution. If the NN has enough neurons to sample the function at an even higher resolution than your training corpus, then there is simply no pressure for the system to generalize because it's not constrained by the number of neurons to do so.
Since no generalization is induced nor required, you might as well just memorize the corpus by storing all of your training data in memory and use k-nearest neighbor, which will always perform better than any NN, and will always perform as well as any NN even as the NN's sampling resolution approaches infinity.
The term hasn't quite caught on in the machine learning literature, which explains all the confusion. It seems like this was a one off paper, an interesting one at that, but it hasn't really led to anything, which may mean several things; the author may have simply lost interest.
I know that Bayesian neural networks (with countably many hidden units, the 'continuous neural networks' paper extends to the uncountable case) were successfully employed by Radford Neal (see his thesis all about this stuff) to win the NIPS 2003 Feature Selection Challenge using Bayesian neural networks.
In the past I've worked on a few research projects using continuous NN's. Activation was done using a bipolar hyperbolic tan, the network took several hundred floating point inputs and output around one hundred floating point values.
In this particular case the aim of the network was to learn the dynamic equations of a mineral train. The network was given the current state of the train and predicted speed, inter-wagon dynamics and other train behaviour 50 seconds into the future.
The rationale for this particular project was mainly about performance. This was being targeted for an embedded device and evaluating the NN was much more performance friendly then solving a traditional ODE (ordinary differential equation) system.
In general a continuous NN should be able to learn any kind of function. This is particularly useful when its impossible/extremely difficult to solve a system using deterministic methods. As opposed to binary networks which are often used for pattern recognition/classification purposes.
Given their non-deterministic nature NN's of any kind are touchy beasts, choosing the right kinds of inputs/network architecture can be somewhat a black art.
Feed forward neural networks are always "continuous" -- it's the only way that backpropagation learning actually works (you can't backpropagate through a discrete/step function because it's non-differentiable at the bias threshold).
You might have a discrete (e.g. "one-hot") encoding of the input or target output, but all of the computation is continuous-valued. The output may be constrained (i.e. with a softmax output layer such that the outputs always sum to one, as is common in a classification setting) but again, still continuous.
If you mean a network that predicts a continuous, unconstrained target -- think of any prediction problem where the "correct answer" isn't discrete, and a linear regression model won't suffice. Recurrent neural networks have at various times been a fashionable method for various financial prediction applications, for example.
Continuous neural networks are not known to be universal approximators (in the sense of density in $L^p$ or $C(\mathbb{R})$ for the topology of uniform convergence on compacts, i.e.: as in the universal approximation theorem) but only universal interpolators in the sense of this paper:
https://arxiv.org/abs/1908.07838

Is Latent Semantic Indexing (LSI) a Statistical Classification algorithm?

Is Latent Semantic Indexing (LSI) a Statistical Classification algorithm? Why or why not?
Basically, I'm trying to figure out why the Wikipedia page for Statistical Classification does not mention LSI. I'm just getting into this stuff and I'm trying to see how all the different approaches for classifying something relate to one another.
No, they're not quite the same. Statistical classification is intended to separate items into categories as cleanly as possible -- to make a clean decision about whether item X is more like the items in group A or group B, for example.
LSI is intended to show the degree to which items are similar or different and, primarily, find items that show a degree of similarity to an specified item. While this is similar, it's not quite the same.
LSI/LSA is eventually a technique for dimensionality reduction, and usually is coupled with a nearest neighbor algorithm to make it a into classification system. Hence in itself, its only a way of "indexing" the data in lower dimension using SVD.
Have you read about LSI on Wikipedia ? It says it uses matrix factorization (SVD), which in turn is sometimes used in classification.
The primary distinction in machine learning is between "supervised" and "unsupervised" modeling.
Usually the words "statistical classification" refer to supervised models, but not always.
With supervised methods the training set contains a "ground-truth" label that you build a model to predict. When you evaluate the model, the goal is to predict the best guess at (or probability distribution of) the true label, which you will not have at time of evaluation. Often there's a performance metric and it's quite clear what the right vs wrong answer is.
Unsupervised classification methods attempt to cluster a large number of data points which may appear to vary in complicated ways into a smaller number of "similar" categories. Data in each category ought to be similar in some kind of 'interesting' or 'deep' way. Since there is no "ground truth" you can't evaluate 'right or wrong', but 'more' vs 'less' interesting or useful.
Similarly evaluation time you can place new examples into potentially one of the clusters (crisp classification) or give some kind of weighting quantifying how similar or different looks like the "archetype" of the cluster.
So in some ways supervised and unsupervised models can yield something which is a "prediction", prediction of class/cluster label, but they are intrinsically different.
Often the goal of an unsupervised model is to provide more intelligent and powerfully compact inputs for a subsequent supervised model.

Resources