vowpal wabbit: --multiclass_oaa does not produce probabilities - vowpalwabbit

I've tried
vw --multilabel_oaa 68 -d vw_data.csv --loss_function=logistic --probabilities -p probabilities.txt
and ended up with target labels only in probabilities.txt. Also -r option designed to produce raw output returned nothing, unfortunately.
Apart from that, I'm not sure is there a way to achieve similar behaviour (multilabel prediction with logistic loss) with other available VW multiclass options such as --csoaa and --wap.

I don't remember exactly, but I think --probabilities does not support multilabel. I even don't know what would be the interpretation (modelling the probability of label co-occurrence? and providing the probabilities for all 2^68 subsets?).
You can use standard multi-class --oaa 68. With --probabilities it should predict the probability for each class, so you can use e.g. some kind of threshold for selecting multiple lables=classes for each example (e.g. such that the sum of their probabilities is at least 42%).

Related

Why does accessing coefficients following estimation with nl require slightly different syntax than for other estimation commands?

Following most estimation commands in Stata (e.g. reg, logit, probit, etc.) one may access the estimates using the _b[ParameterName] syntax (or the synonymous _coef[ParameterName]). For example:
regress y x
followed by
di _b[x]
will display the estimate of the coefficient of x. di _b[_cons] will display the coefficient of the estimated intercept (assuming the regress command was successful), etc.
But if I use the nonlinear least squares command nl I (seemingly) have to do something slightly different. Now (leaving aside that for this example model there is absolutely no need to use a NLLS regression):
nl (y = {_cons} + {x}*x)
followed by (notice the forward slash)
di _b[/x]
will display the estimate of the coefficient of x.
Why does accessing parameter estimates following nl require a different syntax? Are there subtleties to be aware of?
"leaving aside that for this example model there is absolutely no need to use a NLLS regression": I think that's what you can't do here....
The question is about why the syntax is as it is. That's a matter of logic and a matter of history. Why a particular syntax was chosen is ultimately a question for the programmers at StataCorp who chose it. Here is one limited take on your question.
The main syntax for regression-type models grows out of a syntax designed for linear regression models in which by default the parameters include an intercept, as you know.
The original syntax for nonlinear regression models (in the sense of being estimated by nonlinear least-squares) matches a need to estimate a bundle of parameters specified by the user, which need not include an intercept at all.
Otherwise put, there is no question of an intercept being a natural default; no parameterisation is a natural default and each model estimated by nl is sui generis.
A helpful feature is that users can choose the names they find natural for the parameters, within the constraints of what counts as a legal name in Stata, say alpha, beta, gamma, a, b, c, etc. If you choose _cons for the intercept in nl that is a legal name but otherwise not special and just your choice; nl won't take it as a signal that it should flip into using regress conventions.
The syntax you cite is part of what was made possible by a major redesign of nl but it is consistent with the original philosophy.
That the syntax is different because it needs to be may not be the answer you seek, but I guess you'll get a fuller answer only from StataCorp; developers do hang out on Statalist, but they don't make themselves visible here.

Vowpal Wabbit Continuing Training

Continuing with some experimentation here I was interested is seeing how to continuing training a VW model.
I first ran this and saved the model.
vw -d housing.vm --loss_function squared -f housing2.mod --invert_hash readable.housing2.mod
Examining the readable model:
Version 7.7.0
Min label:0.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
^AGE:104042:0.020412
^B:158346:0.007608
^CHAS:102153:1.014402
^CRIM:141890:0.016158
^DIS:182658:0.278865
^INDUS:125597:0.062041
^LSTAT:170288:0.028373
^NOX:165794:2.872270
^PTRATIO:223085:0.108966
^RAD:232476:0.074916
^RM:2580:0.330865
^TAX:108300:0.002732
^ZN:54950:0.020350
Constant:116060:2.728616
If I then continue to train the model using two more examples (in housing_2.vm), which note, has zero values for ZN and CHAS:
27.50 | CRIM:0.14866 ZN:0.00 INDUS:8.560 CHAS:0 NOX:0.5200 RM:6.7270 AGE:79.90 DIS:2.7778 RAD:5 TAX:384.0 PTRATIO:20.90 B:394.76 LSTAT:9.42
26.50 | CRIM:0.11432 ZN:0.00 INDUS:8.560 CHAS:0 NOX:0.5200 RM:6.7810 AGE:71.30 DIS:2.8561 RAD:5 TAX:384.0 PTRATIO:20.90 B:395.58 LSTAT:7.67
If the model saved is loaded and training continues, the coefficients appear to be lost from these zero valued features. Am I doing something wrong or is this a bug?
vw -d housing_2.vm --loss_function squared -i housing2.mod --invert_hash readable.housing3.mod
output from readable.housing3.mod:
Version 7.7.0
Min label:0.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
^AGE:104042:0.023086
^B:158346:0.008148
^CRIM:141890:1.400201
^DIS:182658:0.348675
^INDUS:125597:0.087712
^LSTAT:170288:0.050539
^NOX:165794:3.294814
^PTRATIO:223085:0.119479
^RAD:232476:0.118868
^RM:2580:0.360698
^TAX:108300:0.003304
Constant:116060:2.948345
If you want to continue learning from saved state in a smooth fashion you must use the --save_resume option.
There are 3 fundamentally different types of "state" that can be saved into a vw "model" file:
The weight vector (regressor) obviously. That's the model itself.
invariant parameters like the version of vw (to ensure binary compatibility which is not always preserved between versions), number of bits in the vector (-b), and type of model
state which dynamically changes during learning. This subset includes parameters like learning and decay rates which gradually change during learning with each example, the example numbers themselves, etc.
Only --save_resume saves the last group.
--save_resume is not the default because it has an overhead and in most use-cases it isn't needed. e.g. if you save a model once in order to do many predictions and no learning (-t), there's no need in saving the 3rd subset of state.
So, I believe in your particular case, you want to use --save_resume.
The possibility of a bug always exists, especially since vw supports so many options (about 100 at last count) which are often interdependent. Some option combinations make sense, other don't. Doing a sanity check for roughly 2^100 possible option combinations is a bit unrealistic. If you find a bug, please open an issue on github. In this case, please make sure to use a complete example (full data & command line) so your problem can be reproduced.
Update 2014-09-20 (after an issue was opened on github, thanks!):
The reason for 0 valued features "disappearing" (not really from the model, but only from the --invert_hash output) is that 1) --invert_hash was never designed for multiple passes, because keeping the original feature names in a hash-table, incurs a large performance overhead 2) The missing features are those with a zero value, which are discarded. The model itself should still have any feature with any prior pass non-zero weight in it. Fixing this inconsistency is too complex and costly for implementation reasons, and would go against the overriding motivation of making vw fast, especially for the most useful/common use-cases. Anyway, thanks for the report, I too learned something new from it.

Mathematica- Assumptions within Simplify[]

I am using the ratio between two error probabilities in various functions. I want Mathematica to display this ratio in the most simple manner. How do I let Mathematica know that, in this case, the simplest manner is as the top line in the picture below?
(1-e1)/(1-e2) // TraditionalForm
Be aware this is strictly for output formatting, so for example if you do
x=TraditionalForm[(1-e1)/(1-e2)] ! prints nice
x === (1-e1)/(1-e2) -> False
That said, as a general principle you'll get much more done if you quit loosing sleep over mathematica's sometimes unusual formatting..

Ordinary Least Squares Regression in Vowpal Wabbit

Has anyone managed to run an ordinary least squares regression in Vowpal Wabbit? I'm trying to confirm that it will return the same answer as the exact solution, i.e. when choosing a to minimize ||y - X a||_2 + ||Ra||_2 (where R is the regularization) I want to get the analytic answer
a = (X^T X + R^T R)^(-1) X^T y. Doing this type of regression takes about 5 lines in numpy python.
The documentation of VW suggests that it can do this (presumably the "squared" loss function) but so far I've been unable to get it to come even close to matching the python results. Becuase squared is the default loss function, I'm simply calling:
$ vw-varinfo input.txt
where input.txt has lines like
1.4 | 0:3.4 1:-1.2 2:4.0 .... etc
Do I need some other parameters in the VW call? I'm unable to grok the (rather minimal) documentation.
I think you should use this syntax (vowpal wabbit version 7.3.1):
vw -d input.txt -f linear_model -c --passes 50 --holdout_off --loss_function squared --invert_hash model_readable.txt
This syntax will instruct VW to read your input.txt file, write on disk a model record and a cache (necessary for multi-pass convergence) and fit a regression using the squared loss function. Moreover it will finally write the model coefficients in a readable fashion into a file called model_readable.txt.
The --holdout_off option is a recent additional one in order to suppress the out-of-sample automatic loss computation (if you are using an earlier version you have to remove it).
Basically a regression analysis based on stochastic gradient descent will provide you with a vector of coefficients similar to the exact solution only when no regularization is applied and when the number of passes is high (I would suggest 50 or even more, also randomly shuffling the input file rows would help the algorithm to converge better).

Computationally simple pseudo-Gaussian distribution with varying mean and standard deviation?

This picture from Wikipedia has a nice example of the sort of functions I'd ideally like to generate:
Right now I'm using the Irwin-Hall Distribution, which is more or less a polynomial approximation of the Gaussian distribution...basically, you use uniform random number generator and iterate it x times, and take the average. The more iterations, the more like a Gaussian Distribution it is.
It's pretty nice; however I'd like to be able to have one where I can vary the mean. For example, let's say I wanted a number between the range 0 and 10, but around 7. Like, the mean (if I repeated this function multiple times) would turn out to be 7, but the actual range is 0-10.
Is there one I should look up, or should I work on doing some fancy maths with standard Gaussian distributions?
I see a contradiction in your question. From one side you want normal distribution which is symmetrical by it's nature, from other side you want the range asymmetrically disposed to mean value.
I suspect you should try to look at other distributions density functions of which are like bell curve but asymmetrical. Like log distribution or beta distribution.
Look into generating normal random variates. You can generate pairs of normal random variates X = N(0,1) and tranform it into ANY normal random variate Y = N(m,s) (Y = m + s*X).
Sounds like the Truncated Normal distribution is just what the doctor ordered. It is not "computationally simple" per se, but easy to implement if you have an existing implementation of a normal distribution.
You can just generate the distribution with the mean you want, standard deviation you want, and the two ends wherever you want. You'll have to do some work beforehand to compute the mean and standard deviation of the underlying (non-truncated) normal distribution to get the mean for the TN that you want, but you can use the formulae in that article. Also note that you can adjust the variance as well using this method :)
I have Java code (based on the Commons Math framework) for both an accurate (slower) and quick (less accurate) implementation of this distribution, with PDF, CDF, and sampling.

Resources