Ordinary Least Squares Regression in Vowpal Wabbit - command-line-arguments

Has anyone managed to run an ordinary least squares regression in Vowpal Wabbit? I'm trying to confirm that it will return the same answer as the exact solution, i.e. when choosing a to minimize ||y - X a||_2 + ||Ra||_2 (where R is the regularization) I want to get the analytic answer
a = (X^T X + R^T R)^(-1) X^T y. Doing this type of regression takes about 5 lines in numpy python.
The documentation of VW suggests that it can do this (presumably the "squared" loss function) but so far I've been unable to get it to come even close to matching the python results. Becuase squared is the default loss function, I'm simply calling:
$ vw-varinfo input.txt
where input.txt has lines like
1.4 | 0:3.4 1:-1.2 2:4.0 .... etc
Do I need some other parameters in the VW call? I'm unable to grok the (rather minimal) documentation.

I think you should use this syntax (vowpal wabbit version 7.3.1):
vw -d input.txt -f linear_model -c --passes 50 --holdout_off --loss_function squared --invert_hash model_readable.txt
This syntax will instruct VW to read your input.txt file, write on disk a model record and a cache (necessary for multi-pass convergence) and fit a regression using the squared loss function. Moreover it will finally write the model coefficients in a readable fashion into a file called model_readable.txt.
The --holdout_off option is a recent additional one in order to suppress the out-of-sample automatic loss computation (if you are using an earlier version you have to remove it).
Basically a regression analysis based on stochastic gradient descent will provide you with a vector of coefficients similar to the exact solution only when no regularization is applied and when the number of passes is high (I would suggest 50 or even more, also randomly shuffling the input file rows would help the algorithm to converge better).

Related

Is there a way to use torch.autograd.gradient in parallel in Pytorch?

I am trying to train some network where the loss is not only a function of the output but also the derivative of the output w.r.t. the input. The problem is that while computing the batch output can be done in parallel with the modules with Pytorh, I can't find a way to do the derivative in parallel. Here's the best I can do in serial:
import torch
x=torch.rand(300,1)
dydx=torch.zeros_like(x)
fc=torch.nn.Linear(1,1)
x.requires_grad=True
for ii in range(x.size(0)):
xi=x[ii,0:]
yi=torch.tanh(fc(xi))
dydx[ii]=torch.autograd.grad(yi,xi,create_graph=True)[0]
dydxsum=(dydx**2).sum()
dydxsum.backward()
In the code above, x is split to save memory and time. However, when the size of x becomes large, parallelization (in CUDA) is still necessary. If it has to be implemented by tinkering Pytorch, a hint to where to start will be appreciated.

How to minimize a cost function with Matlab when input variable is a large image: increase speed and prevent memory crash

I am trying to implement a differential phase integration method described in this paper:
Thüring, Thomas, et al. "Non-linear regularized phase retrieval for unidirectional X-ray differential phase contrast radiography." Optics express 19.25 (2011): 25545-25558.
Basically, it's a way to integrate a differential image across the columns only, while imposing some constraints on continuity across the rows to prevent stripe noise.
From a mathematical point of view, I want to minimize the following equation:
where ||.|| is the L2 norm, Dx is the derivative along the columns, Dy is the derivative across the rows, A is the unknown integrated matrix, lambda is a user-defined parameter and phi is the differential profile I measured. Note that for the Dy operator the L1 norm can also be used.
I wrote down a code using fminunc as Matlab solver
pdiff=imresize(diff(padarray(p,[0,1],'replicate','post'),1,2),[128,128]);
noise = 0.02 * randn(size(pdiff));
pdiff_noise = pdiff + noise ;
% normal integration
integratedProfile=cumsum(pdiff_noise,2);
options=optimoptions(#fminunc,'Display','iter-detailed','UseParallel',true,'MaxIterations',35);
% regularized integration
startingPoint=zeros(size(pdiff_noise));
fun=#(x)costFunction(pdiff_noise,x);
integratedProfile_optmized=fminunc(fun,startingPoint,options);
function difference=costFunction(ep,op)
L=0.2;
dep_o=diff(padarray(op,[0,1],'replicate','post'),1,2);
dep_v=diff(padarray(op,[1,0],'replicate','post'),1,1);
difference=sum(sum((ep-dep_o).^2))+L*sum(sum(dep_v.^2));
end
It works using a 128x128 differential image.
The problem arises as soon as I try to work with a larger image. In particular, when I use a 256x256 matrix takes forever to make each iteration even using the parallel option and takes almost the entire RAM.
When I move to a matrix that is 512x512 I get this error
Requested 262144x262144 (512.0GB) array exceeds maximum
array size preference.
Error in fminusub (line 165)
H = eye(sizes.nVar);
Error in fminunc (line 446)
[x,FVAL,GRAD,HESSIAN,EXITFLAG,OUTPUT] =
fminusub(funfcn,x, ...
Error in Untitled (line 13)
integratedProfile_optmized=fminunc(fun,startingPoint,options);
Unfortunately, my final goal is to process approximately 3000 images of 500x500 size.
I think I have understood that the crash problem is related to the size of the matrix and to the fact that each pixel is a variable. Therefore, Matlab needs to calculate a huge hessian that doesn't fit into the memory.
However, I don't really know how to solve it while also speeding up the processing.
Do you have any suggestions on how to work with large images? Is there another solver that may work in a faster way? Any mathematical approach to making the problem easier?
Thanks!

vowpal wabbit: --multiclass_oaa does not produce probabilities

I've tried
vw --multilabel_oaa 68 -d vw_data.csv --loss_function=logistic --probabilities -p probabilities.txt
and ended up with target labels only in probabilities.txt. Also -r option designed to produce raw output returned nothing, unfortunately.
Apart from that, I'm not sure is there a way to achieve similar behaviour (multilabel prediction with logistic loss) with other available VW multiclass options such as --csoaa and --wap.
I don't remember exactly, but I think --probabilities does not support multilabel. I even don't know what would be the interpretation (modelling the probability of label co-occurrence? and providing the probabilities for all 2^68 subsets?).
You can use standard multi-class --oaa 68. With --probabilities it should predict the probability for each class, so you can use e.g. some kind of threshold for selecting multiple lables=classes for each example (e.g. such that the sum of their probabilities is at least 42%).

Vowpal Wabbit Continuing Training

Continuing with some experimentation here I was interested is seeing how to continuing training a VW model.
I first ran this and saved the model.
vw -d housing.vm --loss_function squared -f housing2.mod --invert_hash readable.housing2.mod
Examining the readable model:
Version 7.7.0
Min label:0.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
^AGE:104042:0.020412
^B:158346:0.007608
^CHAS:102153:1.014402
^CRIM:141890:0.016158
^DIS:182658:0.278865
^INDUS:125597:0.062041
^LSTAT:170288:0.028373
^NOX:165794:2.872270
^PTRATIO:223085:0.108966
^RAD:232476:0.074916
^RM:2580:0.330865
^TAX:108300:0.002732
^ZN:54950:0.020350
Constant:116060:2.728616
If I then continue to train the model using two more examples (in housing_2.vm), which note, has zero values for ZN and CHAS:
27.50 | CRIM:0.14866 ZN:0.00 INDUS:8.560 CHAS:0 NOX:0.5200 RM:6.7270 AGE:79.90 DIS:2.7778 RAD:5 TAX:384.0 PTRATIO:20.90 B:394.76 LSTAT:9.42
26.50 | CRIM:0.11432 ZN:0.00 INDUS:8.560 CHAS:0 NOX:0.5200 RM:6.7810 AGE:71.30 DIS:2.8561 RAD:5 TAX:384.0 PTRATIO:20.90 B:395.58 LSTAT:7.67
If the model saved is loaded and training continues, the coefficients appear to be lost from these zero valued features. Am I doing something wrong or is this a bug?
vw -d housing_2.vm --loss_function squared -i housing2.mod --invert_hash readable.housing3.mod
output from readable.housing3.mod:
Version 7.7.0
Min label:0.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
^AGE:104042:0.023086
^B:158346:0.008148
^CRIM:141890:1.400201
^DIS:182658:0.348675
^INDUS:125597:0.087712
^LSTAT:170288:0.050539
^NOX:165794:3.294814
^PTRATIO:223085:0.119479
^RAD:232476:0.118868
^RM:2580:0.360698
^TAX:108300:0.003304
Constant:116060:2.948345
If you want to continue learning from saved state in a smooth fashion you must use the --save_resume option.
There are 3 fundamentally different types of "state" that can be saved into a vw "model" file:
The weight vector (regressor) obviously. That's the model itself.
invariant parameters like the version of vw (to ensure binary compatibility which is not always preserved between versions), number of bits in the vector (-b), and type of model
state which dynamically changes during learning. This subset includes parameters like learning and decay rates which gradually change during learning with each example, the example numbers themselves, etc.
Only --save_resume saves the last group.
--save_resume is not the default because it has an overhead and in most use-cases it isn't needed. e.g. if you save a model once in order to do many predictions and no learning (-t), there's no need in saving the 3rd subset of state.
So, I believe in your particular case, you want to use --save_resume.
The possibility of a bug always exists, especially since vw supports so many options (about 100 at last count) which are often interdependent. Some option combinations make sense, other don't. Doing a sanity check for roughly 2^100 possible option combinations is a bit unrealistic. If you find a bug, please open an issue on github. In this case, please make sure to use a complete example (full data & command line) so your problem can be reproduced.
Update 2014-09-20 (after an issue was opened on github, thanks!):
The reason for 0 valued features "disappearing" (not really from the model, but only from the --invert_hash output) is that 1) --invert_hash was never designed for multiple passes, because keeping the original feature names in a hash-table, incurs a large performance overhead 2) The missing features are those with a zero value, which are discarded. The model itself should still have any feature with any prior pass non-zero weight in it. Fixing this inconsistency is too complex and costly for implementation reasons, and would go against the overriding motivation of making vw fast, especially for the most useful/common use-cases. Anyway, thanks for the report, I too learned something new from it.

Computationally simple pseudo-Gaussian distribution with varying mean and standard deviation?

This picture from Wikipedia has a nice example of the sort of functions I'd ideally like to generate:
Right now I'm using the Irwin-Hall Distribution, which is more or less a polynomial approximation of the Gaussian distribution...basically, you use uniform random number generator and iterate it x times, and take the average. The more iterations, the more like a Gaussian Distribution it is.
It's pretty nice; however I'd like to be able to have one where I can vary the mean. For example, let's say I wanted a number between the range 0 and 10, but around 7. Like, the mean (if I repeated this function multiple times) would turn out to be 7, but the actual range is 0-10.
Is there one I should look up, or should I work on doing some fancy maths with standard Gaussian distributions?
I see a contradiction in your question. From one side you want normal distribution which is symmetrical by it's nature, from other side you want the range asymmetrically disposed to mean value.
I suspect you should try to look at other distributions density functions of which are like bell curve but asymmetrical. Like log distribution or beta distribution.
Look into generating normal random variates. You can generate pairs of normal random variates X = N(0,1) and tranform it into ANY normal random variate Y = N(m,s) (Y = m + s*X).
Sounds like the Truncated Normal distribution is just what the doctor ordered. It is not "computationally simple" per se, but easy to implement if you have an existing implementation of a normal distribution.
You can just generate the distribution with the mean you want, standard deviation you want, and the two ends wherever you want. You'll have to do some work beforehand to compute the mean and standard deviation of the underlying (non-truncated) normal distribution to get the mean for the TN that you want, but you can use the formulae in that article. Also note that you can adjust the variance as well using this method :)
I have Java code (based on the Commons Math framework) for both an accurate (slower) and quick (less accurate) implementation of this distribution, with PDF, CDF, and sampling.

Resources