Adaptive Updates In Vowpal Wabbit Formula - vowpalwabbit

I am looking at the following 2 presentations about the updates done by VW when the --adaptive flag is used. It seems these are different.
http://www.slideshare.net/jakehofman/technical-tricks-of-vowpal-wabbit
https://github.com/JohnLangford/vowpal_wabbit/wiki/v6.1_tutorial.pdf
With these two descriptions (respectively):
#1
#2
My questions:
Which of these are correct (or are they the same)?
For number 1 it appears that the gradient from the t+1 example is used in the denominator. How is this done? Does this mean that the new weight (labeled w_i) is the weight for example t+1?

As you noticed, the first presentation contains an error/typo in the AdaGrad formula. The formula should be w_{i, t+1} := w_{i, t} - (\eta * g_{i, t} / \sqrt{sum}), where sum=\sum_{t'=1}^t g_{i, t'}^2.
In VowpalWabbit, --adaptive (corresponding to the AdaGrad idea) is on by default. But --normalized and --invariant are also on by default, which means that on top of plain AdaGrad few more tricks/improvements are applied. The interaction of all these tricks is complex and there is no single slide which describes all the aspects, so the only reference is the source code (gd.cc).
Which of these are correct (or are they the same)?
I think they are not same, but they are different "layers" of the complex code. I think that the slide 33 (which you cite as #2) of the second presentation corresponds to the slide 31 (which you don't cite) of the
first presentation, but I am not sure.

Related

How does choosing between pre and post zero padding of sequences impact results

I'm working on an NLP sequence labelling problem. My data consists of variable length sequences (w_1, w_2, ..., w_k) with corresponding labels (l_1, l_2, ..., l_k) (in this case the task is named entity extraction).
I intend to solve the problem using Recurrent Neural Networks. As the sequences are of variable length I need to pad them (I want batch size >1). I have the option of either pre zero padding them, or post zero padding them. I.e. either I make every sequence (0, 0, ..., w_1, w_2, ..., w_k) or (w_1, w_2, ..., w_k, 0, 0, ..., 0) such that the lenght of each sequence is the same.
How does the choice between pre- and post padding impact results?
It seems like pre padding is more common, but I can't find an explanation of why it would be better. Due to the nature of RNNs it feels like an arbitrary choice for me, since they share weights across time steps.
Commonly in RNN's, we take the final output or hidden state and use this to make a prediction (or do whatever task we are trying to do).
If we send a bunch of 0's to the RNN before taking the final output (i.e. 'post' padding as you describe), then the hidden state of the network at the final word in the sentence would likely get 'flushed out' to some extent by all the zero inputs that come after this word.
So intuitively, this might be why pre-padding is more popular/effective.
This paper (https://arxiv.org/pdf/1903.07288.pdf) studied the effect of padding types on LSTM and CNN. They found that post-padding achieved substantially lower accuracy (nearly half) compared to pre-padding in LSTMs, although there wasn't a significant difference for CNNs (post-padding was only slightly worse).
A simple/intuitive explanation for RNNs is that, post-padding seems to add noise to what has been learned from the sequence through time, and there aren't more timesteps for the RNN to recover from this noise. With pre-padding, however, the RNN is better able to adjust to the added noise of zeros at the beginning as it learns from the sequence through time.
I think more thorough experiments are needed in the community for more detailed mechanistic explanations on how padding affects performance.
I always recommend using pre-padding over post-padding, even for CNNs, unless the problem specifically requires post-padding.

Vowpal Wabbit Continuing Training

Continuing with some experimentation here I was interested is seeing how to continuing training a VW model.
I first ran this and saved the model.
vw -d housing.vm --loss_function squared -f housing2.mod --invert_hash readable.housing2.mod
Examining the readable model:
Version 7.7.0
Min label:0.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
^AGE:104042:0.020412
^B:158346:0.007608
^CHAS:102153:1.014402
^CRIM:141890:0.016158
^DIS:182658:0.278865
^INDUS:125597:0.062041
^LSTAT:170288:0.028373
^NOX:165794:2.872270
^PTRATIO:223085:0.108966
^RAD:232476:0.074916
^RM:2580:0.330865
^TAX:108300:0.002732
^ZN:54950:0.020350
Constant:116060:2.728616
If I then continue to train the model using two more examples (in housing_2.vm), which note, has zero values for ZN and CHAS:
27.50 | CRIM:0.14866 ZN:0.00 INDUS:8.560 CHAS:0 NOX:0.5200 RM:6.7270 AGE:79.90 DIS:2.7778 RAD:5 TAX:384.0 PTRATIO:20.90 B:394.76 LSTAT:9.42
26.50 | CRIM:0.11432 ZN:0.00 INDUS:8.560 CHAS:0 NOX:0.5200 RM:6.7810 AGE:71.30 DIS:2.8561 RAD:5 TAX:384.0 PTRATIO:20.90 B:395.58 LSTAT:7.67
If the model saved is loaded and training continues, the coefficients appear to be lost from these zero valued features. Am I doing something wrong or is this a bug?
vw -d housing_2.vm --loss_function squared -i housing2.mod --invert_hash readable.housing3.mod
output from readable.housing3.mod:
Version 7.7.0
Min label:0.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
^AGE:104042:0.023086
^B:158346:0.008148
^CRIM:141890:1.400201
^DIS:182658:0.348675
^INDUS:125597:0.087712
^LSTAT:170288:0.050539
^NOX:165794:3.294814
^PTRATIO:223085:0.119479
^RAD:232476:0.118868
^RM:2580:0.360698
^TAX:108300:0.003304
Constant:116060:2.948345
If you want to continue learning from saved state in a smooth fashion you must use the --save_resume option.
There are 3 fundamentally different types of "state" that can be saved into a vw "model" file:
The weight vector (regressor) obviously. That's the model itself.
invariant parameters like the version of vw (to ensure binary compatibility which is not always preserved between versions), number of bits in the vector (-b), and type of model
state which dynamically changes during learning. This subset includes parameters like learning and decay rates which gradually change during learning with each example, the example numbers themselves, etc.
Only --save_resume saves the last group.
--save_resume is not the default because it has an overhead and in most use-cases it isn't needed. e.g. if you save a model once in order to do many predictions and no learning (-t), there's no need in saving the 3rd subset of state.
So, I believe in your particular case, you want to use --save_resume.
The possibility of a bug always exists, especially since vw supports so many options (about 100 at last count) which are often interdependent. Some option combinations make sense, other don't. Doing a sanity check for roughly 2^100 possible option combinations is a bit unrealistic. If you find a bug, please open an issue on github. In this case, please make sure to use a complete example (full data & command line) so your problem can be reproduced.
Update 2014-09-20 (after an issue was opened on github, thanks!):
The reason for 0 valued features "disappearing" (not really from the model, but only from the --invert_hash output) is that 1) --invert_hash was never designed for multiple passes, because keeping the original feature names in a hash-table, incurs a large performance overhead 2) The missing features are those with a zero value, which are discarded. The model itself should still have any feature with any prior pass non-zero weight in it. Fixing this inconsistency is too complex and costly for implementation reasons, and would go against the overriding motivation of making vw fast, especially for the most useful/common use-cases. Anyway, thanks for the report, I too learned something new from it.

Ising 2D Optimization

I have implemented a MC-Simulation of the 2D Ising model in C99.
Compiling with gcc 4.8.2 on Scientific Linux 6.5.
When I scale up the grid the simulation time increases, as expected.
The implementation simply uses the Metropolis–Hastings algorithm.
I tried to find out a way to speed up the algorithm, but I haven't any good idea ?
Are there some tricks to do so ?
As jimifiki wrote, try to do a profiling session.
In order to improve on the algorithmic side only, you could try the following:
Lookup Table:
When calculating the energy difference for the Metropolis criteria you need to evaluate the exponential exp[-K / T * dE ] where K is your scaling constant (in units of Boltzmann's constant) and dE the energy-difference between the original state and the one after a spin-flip.
Calculating exponentials is expensive
So you simply build a table beforehand where to look up the possible values for the dE. There will be (four choose one plus four choose two plus four choose three plus four choose four) possible combinations for a nearest-neightbour interaction, exploit the problem's symmetry and you get five values fordE: 8, 4, 0, -4, -8. Instead of using the exp-function, use the precalculated table.
Parallelization:
As mentioned before, it is possible to parallelize the algorithm. To preserve the physical correctness, you have to use a so-called checkerboard concept. Consider the two-dimensional grid as a checkerboard and compute only the white cells parallel at once, then the black ones. That should be clear, considering the nearest-neightbour interaction which introduces dependencies of the values.
Use GPGPU:
You can also implement the simulation on a GPGPU, e.g. using CUDA, if you're already working on C99.
Some tips:
- Don't forget to align C99-structs properly.
- Use linear Arrays, not that nested ones. Aligned memory is normally faster to access, if done properly.
- Try to let the compiler do loop-unrolling, etc. (gcc special options, not default on O2)
Some more information:
If you look for an efficient method to calculate the critical point of the system, the method of choice would be finite-size scaling where you simulate at different system-sizes and different temperature, then calculate a value which is system-size independet at the critical point, therefore an intersection point of the corresponding curves (please see the theory to get a detailed explaination)
I hope I was helpful.
Cheers...
It's normal that your simulation times scale at least with the square of the size. Isn't it?
Here some subjestions:
If you are concerned with thermalization issues, try to use parallel tempering. It can be of help.
The Metropolis-Hastings algorithm can be made parallel. You could try to do it.
Check you are not pessimizing the code.
Are your spin arrays of ints? You could put many spins on the same int. It's a lot of work.
Moreover, remember what Donald taught us:
premature optimisation is the root of all evil
Before optimising you should first understand where your program is slow. This is called profiling.

2D PHP Pattern Creator Altgorithm

I'm looking for an algorithm to help me build 2D patterns based on rules. The idea is that I could write a script using a given site of parameters, and it would return a random, 2-dimensional sequence up to a given length.
My plan is to use this to generate image patterns based on rules. Things like image fractals or sprites for game levels could possibly use this.
For example, lets say that you can use A, B, C, & D to create the pattern. The rule is that C and A can never be next to each other, and that D always follows C. Next, lets say I want a pattern of size 4x4. The result might be the following which respects all the rules.
A B C D
B B B B
C D B B
C D C D
Are there any existing libraries that can do calculations like this? Are there any mathematical formulas I can read-up on?
While pretty inefficient concering runtime, backtracking is an often used algorithm for such a problem.
It follows a simple pattern, and if written correctly, you can easily replace a rule set into it.
Define your rule data structures; i.e., define the set of operations that the rules can encapsulate, and define the available cross-referencing that can be done. Once you've done this, you should have a clearer view of what type of algorithms to use to apply these rules to a potential result set.
Supposing that your rules are restricted to "type X is allowed to have type Y immediately to its left/right/top/bottom" you potentially have situations where generating possible patterns is computationally difficult. Take a look at Wang Tiles (a good source is the book Tilings and Patterns by Grunbaum and Shephard) and you'll see that with the states sets of rules you might define sets of Wang Tiles. Appropriate sets of these are Turing Complete.
For small rectangles, or your sets of rules, this may only be of academic interest. As mentioned elsewhere a backtracking approach might be appropriate for your ruleset - in which case you may want to consider appropriate heuristics for the order in which new components are added to your grid. Again, depending on your rulesets, other approaches might work. E.g. if your ruleset admits many solutions you might get a long way by randomly allocating many items to the grid before attempting to fill in remaining gaps.

intelligent path truncation/ellipsis for display

I am looking for an existign path truncation algorithm (similar to what the Win32 static control does with SS_PATHELLIPSIS) for a set of paths that should focus on the distinct elements.
For example, if my paths are like this:
Unit with X/Test 3V/
Unit with X/Test 4V/
Unit with X/Test 5V/
Unit without X/Test 3V/
Unit without X/Test 6V/
Unit without X/2nd Test 6V/
When not enough display space is available, they should be truncated to something like this:
...with X/...3V/
...with X/...4V/
...with X/...5V/
...without X/...3V/
...without X/...6V/
...without X/2nd ...6V/
(Assuming that an ellipsis generally is shorter than three letters).
This is just an example of a rather simple, ideal case (e.g. they'd all end up at different lengths now, and I wouldn't know how to create a good suggestion when a path "Thingie/Long Test/" is added to the pool).
There is no given structure of the path elements, they are assigned by the user, but often items will have similar segments. It should work for proportional fonts, so the algorithm should take a measure function (and not call it to heavily) or generate a suggestion list.
Data-wise, a typical use case would contain 2..4 path segments anf 20 elements per segment.
I am looking for previous attempts into that direction, and if that's solvable wiht sensible amount of code or dependencies.
I'm assuming you're asking mainly about how to deal with the set of folder names extracted from the same level of hierarchy, since splitting by rows and path separators and aggregating by hierarchy depth is simple.
Your problem reminds me a lot of the longest common substring problem, with the differences that:
You're interested in many substrings, not just one.
You care about order.
These may appear substantial, but if you examine the dynamic-programming solution in the article you can see that it revolves around creating a table of "character collisions" and then looking for the longest diagonal in this table. I think that you could instead enumerate all diagonals in the table by the order in which they appear, and then for each path replace, by order, all appearances of these strings with ellipses.
Enforcing a minimal substring length of 2 will return a result similar to what you've outlined in your question.
It does seem like it requires some tinkering with the algorithm (for example, ensuring a certain substring is first in all strings), and then you need to invoke it over your entire set... I hope this at least gives you a possible direction.
Well, the "natural number" ordering part is actually easy, simply replace all numbers with formatted number where there is enough leading zeroes, eg. Test 9V -> Test 000009V and Test 12B -> Test 000012B. These are now sortable by standard methods.
For the actual ellipsisizing. Unless this is actually a huge system, I'd just add manual ellipsisizing "list" (of regexes, for flexibility and pain) that'd turn certain words into ellipses. This does requires continuous work, but coming up with the algorithm eats your time too; there are myriads of corner cases.
I'd probably try a "Floodfill" approach. Arrange first level of directories as you would a bitmap, every letter is a pixel. iterate over all characters that are in names of directories. with all of them, "paint" this same character, then "paint" the next character from first string such that it follows this previous character (and so on etc.) Then select the longest painted string that you find.
Example (if prefixed with *, it's painted)
Foo
BarFoo
*Foo
Bar*Foo
*F*oo
Bar*F*oo
...
note that:
*ofoo
b*oo
*o*foo
b*oo
.. painting of first 'o' stops since there are no continuing characters.
of*oo
b*oo
...
And then you get to to second "o" and it will find a substring of at least 2.
So you will have to iterate over most possible character instances (one optimization is to stop in each string at position Length-n, where n is the longest already found common substring. But then there is yet another problem (here with "Beta Beta")
| <- visibility cutout
Alfa Beta Gamma Delta 1
Alfa Beta Gamma Delta 2
Alfa Beta Beta 1
Alfa Beta Beta 2
Beta Beta 1
Beta Beta 2
Beta Beta 3
Beta Beta 4
What do you want to do? Cut Alfa Beta Gamma Delta or Alfa Beta or Beta Beta or Beta?
This is a bit rambling, but might be entertaining :).

Resources