I wanted to do lasso regression with Vowpal Wabbit. So, I used this command line
vw --save_resume --readable_model ob/e/nsefut/VW_testing/BuyModel.VWM -d ob/e/nsefut/VW_testing/VWsell.VWF --quiet --predictions ob/e/nsefut/VW_testing/predict.VW --loss_function logistic --noconstant --l1 0.001
The readable file shows no weights of the features that I used. But when I skip the param --l1 then it shows the weights properly.Plus when I dont give --l1 param, it came up with weights like this..
1:-0.437898 994842.000000 1.000000
33340:-0.176359 201942.265625 1.006310
59044:-0.152967 201843.875000 1.002754
63438:-0.187405 202149.140625 1.015530
124204:-0.159398 201741.187500 1.002742
166130:-0.185312 201754.421875 1.013330
Which suggest that all the weights are negative. But all my features are +ve valued. Hence my linear combination of features would be negative for all observation resulting prediction -ve for all observation. But I am seeing both +ve and -ve as predicted label.
Three questions
Whether my command line is correct for lasso regression.
What will enable me to see the weights.
What is it that making me not understand the all negative weights incident
Related
I am trying to find outliers in Residual. I used three algorithms basically if the residuals magnitudes are less, the algorithm performances are good but if the residuals magnitude are big, the algorithm performances are not good.
1) 𝑿^𝟐=〖(𝒚−𝒉(𝒙))〗^𝑻 𝑺^(−𝟏) (𝒚−𝒉(𝒙)) - Chi-Square Test
if the matrix 3x3 - degree of freedom is 4.
𝑿^𝟐 > 13.277
2) Residual(i) > 3√(HP 𝐻^𝑇 + R) - Measurement Covariance Noise
3) Residual(i) > 3-Sigma
I have applied three algorithms to find the outliers. First one is Chi Square Test, second checks Measurement Covariance Noise, Third looks the 3 sigma.
Can you give any suggestion about the algorithms or I can implement a new way if you suggest?
The third case cannot be correct for all case because if there is a large residual, will fail. The second one is more stable because it is related to measurement noise covariance so that your residual should change according to the measurement covariance error.
I would like to run a linear regression on vowpal wabbit using the null model (intercept only - for comparison reasons). Which optimizer should I use for this? Also is the best constant loss reported that of the simple average?
A1: For linear regression, if you care about averages, you should use --loss_function squared (which is the default). If you care more about the median rather than the average (e.g. if you have some outliers that may greatly mess-up the average), use --loss_function quantile. BTW: these are not optimizers, just loss functions. I would leave the optimizer (enhanced SGD) as is (the default) since it works very well.
A2: best constant is the constant prediction that would give the lowest error, and best constant loss is the average error for always predicting that best constant number. It is the weighted average of all your target-variables. This is not the same as the intercept b in the linear-regression formula y = Ai*xi + B. B is the free term, independent of the inputs. B is not necessarily the average of the ys.
A3: If you want to find the intercept of your model, look for the weight named Constant in your model. This would require two short steps:
# 1) Train your model from the dataset
# and save the model in human-readable (aka "inverted hash") format
vw --invert_hash model.ih your_dataset
# 2) Search for the free/intercept term in the readable model
grep '^Constant:' model.ih
The output of the grep step should be something like:
Constant:116060:-1.085126
Where 116060 is the hash-slot (location in the model) and -1.085126 is the value of the intercept (assuming no hash collisions, and a linear combination of the inputs.)
I'm implementing AdaBoost(Boosting) that will use CART and C4.5. I read about AdaBoost, but i can't find good explenation how to join AdaBoost with Decision Trees. Let say i have data set D that have n examples. I split D to TR training examples and TE testing examples.
Let say TR.count = m,
so i set weights that should be 1/m, then i use TR to build tree, i test it with TR to get wrong examples, and test with TE to calculate error. Then i change weights, and now how i will get next Training Set? What kind of sampling should i use (with or without replacemnet)? I know that new Training Set should focus more on samples that were wrong classified but how can i achieve this? Well how CART or C4.5 will know that they should focus on examples with greater weight?
As I know, the TE data sets don't mean to be used to estimate the error rate. The raw data can be split into two parts (one for training, the other for cross validation). Mainly, we have two methods to apply weights on the training data sets distribution. Which method to use is determined by the weak learner you choose.
How to apply the weights?
Re-sample the training data sets without replacement. This method can be viewed as weighted boosting method. The generated re-sampling data sets contain miss-classification instances with higher probability than the correctly classified ones, therefore it force the weak learning algorithm to concentrate on the miss-classified data.
Directly use the weights when learning. Those models include Bayesian Classification, Decision Tree (C4.5 and CART) and so on. With respect to C4.5, we calculate the the gain information (mutation information) to determinate which predictor will be selected as the next node. Hence we can combine the weights and entropy to estimate the measurements. For example, we view the weights as the probability of the sample in the distribution. Given X = [1,2,3,3], weights [3/8,1/16,3/16,6/16 ]. Normally, the cross-entropy of X is (-0.25log(0.25)-0.25log(0.25)-0.5log(0.5)), but with weights taken into consideration, its weighted cross-entropy is (-(3/8)log(3/8)-(1/16)log(1/16)-(9/16log(9/16))). Generally, the C4.5 can be implemented by weighted cross-entropy, and its weight is [1,1,...,1]/N.
If you want to implement the AdaboostM.1 with C4.5 algorithmsm you should read some stuff in Page 339, the Elements of Statistical Learning.
I have given vowpal wabbit a dataset with two labels and performed logistic regression with it. The problem is, it is returning real numbers varying from positive to negative as prediction. Now if I want to transform these values to probability of some sort. How should I go about it.
I was thinking maybe the predicted value is a'x where a is coefficient vector and x is the feature vector. If this is the case then I can directly use the binomial link function to get the probs.
Use --link=logistic in command line.
Alternatively you may use script logistic in vw's utl folder to convert already obtained results.
Pls refer to How to return predictions in the [0, 1] interval for SVMs in vowpal wabbit
I am trying to use SVM to train some image models. However SVM is not a probabilistic framework so it outputs distance between hyperplanes as a whole number.
Platt converted the output of SVM to likelihood by using some optimisation function but I fail to understand that, does the method assumes one class has same probability I.E for binary classifier if all training sets are even and proportional, then for label 1 or -1 it occurs every time with 50% probability.
Secondly, in some papers I read that for binary SVM classifier they convert -1 and 1 label to range of 0 to 1 and compute the likelihood. But they do not mention anything about how to convert the SVM distance to probability.
Sorry for my english. I would welcome any suggestion and comment. Thank you.
link to paper
Well as far as I can tell that paper is proposing a mapping from the SVM output to a range of [0,1] using a sigmoid function.
From a simplified point of view, it would be something like Sigmoid(RAWSVM(X)) in [0,1], so there is not an explicit "weight" to the labels. The idea is that you take one label (let's say Y=+1) and then you take the output of the SVM and see how close is the prediction for that pattern to that label, if it is close then the sigmoid would give you a number close to 1, otherwise will give you a number close to 0. And hence you have a sense of probability.
Secondly, in some papers I read that for binary SVM classifier they convert -1 and 1 label to range of 0 to 1 and compute the likelihood. But they do not mention anything about how to convert the SVM distance to probability.
Yes, you are correct and some implementations works in the realm of [0,1] instead of [-1,+1], some even maps the label to a factor depending on the value of C. In any case, that shouldn't affect the method proposed in the paper since they would map any range to [0,1]. Keep in mind that this "probabilistic" distribution is just a map from any range to [0,1] assuming uniformity. I am oversimplifying this but the effect is the same.
One last thing, that sigmoid map is not static but data-driven, which means that there would be some training using the dataset to parametrize the sigmoid to adjust it to the data. In other words, for two different datasets you would probably get two different mapping functions.