Why average loss goes up when training using Vowpal Wabbit - vowpalwabbit

I tried to use VW to train a regression model on a small set of examples (about 3112). I think I'm doing it correctly, yet it showed me weird results. Dug around but didn't find anything helpful.
$ cat sh600000.feat | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.040000 0.040000 1 1.0 -0.2000 0.0000 79
0.051155 0.062310 2 2.0 0.2000 -0.0496 79
0.046606 0.042056 4 4.0 0.4100 0.1482 79
0.052160 0.057715 8 8.0 0.0200 0.0021 78
0.064936 0.077711 16 16.0 -0.1800 0.0547 77
0.060507 0.056079 32 32.0 0.0000 0.3164 79
0.136933 0.213358 64 64.0 -0.5900 -0.0850 79
0.151692 0.166452 128 128.0 0.0700 0.0060 79
0.133965 0.116238 256 256.0 0.0900 -0.0446 78
0.179995 0.226024 512 512.0 0.3700 -0.0217 79
0.109296 0.038597 1024 1024.0 0.1200 -0.0728 79
0.579360 1.049425 2048 2048.0 -0.3700 -0.0084 79
0.485389 0.485389 4096 4096.0 1.9600 0.3934 79 h
0.517748 0.550036 8192 8192.0 0.0700 0.0334 79 h
finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506
$ wc model
41 48 657 model
Questions:
Why is the number of features in the output (readable) model less than the number of actual features? I counted that the training data contains 78 features (plus the bias that's 79 as shown during the training). The number of feature bits is 24, which should be far more than enough to avoid collision.
Why does the average loss actually go up in the training as you can see in the above example?
(Minor) I tried to increase the number of feature bits to 32, and it output an empty model. Why?
EDIT:
I tried to shuffle the input file, as well as using --holdout_off, as suggested. But the result is still almost the same - the average loss go up.
$ cat sh600000.feat.shuf | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache --holdout_off
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.040000 0.040000 1 1.0 -0.2000 0.0000 79
0.051155 0.062310 2 2.0 0.2000 -0.0496 79
0.046606 0.042056 4 4.0 0.4100 0.1482 79
0.052160 0.057715 8 8.0 0.0200 0.0021 78
0.071332 0.090504 16 16.0 0.0300 0.1203 79
0.043720 0.016108 32 32.0 -0.2200 -0.1971 78
0.142895 0.242071 64 64.0 0.0100 -0.1531 79
0.158564 0.174232 128 128.0 0.0500 -0.0439 79
0.150691 0.142818 256 256.0 0.3200 0.1466 79
0.197050 0.243408 512 512.0 0.2300 -0.0459 79
0.117398 0.037747 1024 1024.0 0.0400 0.0284 79
0.636949 1.156501 2048 2048.0 1.2500 -0.0152 79
0.363364 0.089779 4096 4096.0 0.1800 0.0071 79
0.477569 0.591774 8192 8192.0 -0.4800 0.0065 79
0.411068 0.344567 16384 16384.0 0.0700 0.0450 77
finished run
number of examples per pass = 3112
passes used = 10
weighted example sum = 31120
weighted label sum = -105.5
average loss = 0.423404
best constant = -0.0033901
total feature number = 2451800
The training examples are unique to each other so I doubt there is over-fitting problem (which, as I understand it, usually happens when the number of input is too small comparing the number of features).
EDIT2:
Tried to print the average loss for every pass of examples, and see that it mostly remains constant.
$ cat dist/sh600000.feat | vw --l1 1e-8 --l2 1e-8 -f dist/model -P 3112 --passes 10 -b 24 --cache_file dist/cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.498822 0.498822 3112 3112.0 0.0800 0.0015 79 h
0.476677 0.454595 6224 6224.0 -0.2200 -0.0085 79 h
0.466413 0.445856 9336 9336.0 0.0200 -0.0022 79 h
0.490221 0.561506 12448 12448.0 0.0700 -0.1113 79 h
finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506
Also another try without the --l1, --l2 and -b parameters:
$ cat dist/sh600000.feat | vw -f dist/model -P 3112 --passes 10 --cache_file dist/cacheNum weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.520286 0.520286 3112 3112.0 0.0800 -0.0021 79 h
0.488581 0.456967 6224 6224.0 -0.2200 -0.0137 79 h
0.474247 0.445538 9336 9336.0 0.0200 -0.0299 79 h
0.496580 0.563450 12448 12448.0 0.0700 -0.1727 79 h
0.533413 0.680958 15560 15560.0 -0.1700 0.0322 79 h
0.524531 0.480201 18672 18672.0 -0.9800 -0.0573 79 h
finished run
number of examples per pass = 2801
passes used = 7
weighted example sum = 19608
weighted label sum = -212.58
average loss = 0.491739 h
best constant = -0.0108415
total feature number = 1544713
Does that mean it's normal for average loss to go up during one pass, but as long as multiple pass gets the same loss then it's fine?

Model file stores only non-zero weights. So most likely others got nulled especially if you are using --l1
It may be caused by many reasons. Perhaps your dataset isn't shuffled well enough. If you sort your dataset so examples labeled -1 will be in first half and examples labeled 1 will be in second then your model will show very good convergence on first half, but you'll see avg loss bump as it reaches 2nd half. So it may be disbalance in dataset. As for last two losses - these are holdout losses (marked with 'h' at end of line) and may point that model is overfitted. Pls refer to my other answer.
Well, in master branch usage of -b 32 is even currently blocked. You shall use up to -b 31. On practice -b 24-28 is usually enough even for dozens of thousands of features.
I would recommend you to get up-to-date VW version from github

Related

Quickly compute `dot(a(n:end), b(1:end-n))`

Suppose we have two, one dimensional arrays of values a and b which both have length N. I want to create a new array c such that c(n)=dot(a(n:N), b(1:N-n+1)) I can of course do this using a simple loop:
for n=1:N
c(n)=dot(a(n:N), b(1:N-n+1));
end
but given that this is such a simple operation which resembles a convolution I was wondering if there isn't a more efficient method to do this (using Matlab).
A solution using 1D convolution conv:
out = conv(a, flip(b));
c = out(ceil(numel(out)/2):end);
In conv the first vector is multiplied by the reversed version of the second vector so we need to compute the convolution of a and the flipped b and trim the unnecessary part.
This is an interesting problem!
I am going to assume that a and b are column vectors of the same length. Let us consider a simple example:
a = [9;10;2;10;7];
b = [1;3;6;10;10];
% yields:
c = [221;146;74;31;7];
Now let's see what happens when we compute the convolution of these vectors:
>> conv(a,b)
ans =
9
37
86
166
239
201
162
170
70
>> conv2(a, b.')
ans =
9 27 54 90 90
10 30 60 100 100
2 6 12 20 20
10 30 60 100 100
7 21 42 70 70
We notice that c is the sum of elements along the lower diagonals of the result of conv2. To show it clearer we'll transpose to get the diagonals in the same order as values in c:
>> triu(conv2(a.', b))
ans =
9 10 2 10 7
0 30 6 30 21
0 0 12 60 42
0 0 0 100 70
0 0 0 0 70
So now it becomes a question of summing the diagonals of a matrix, which is a more common problem with existing solution, for example this one by Andrei Bobrov:
C = conv2(a.', b);
p = sum( spdiags(C, 0:size(C,2)-1) ).'; % This gives the same result as the loop.

how the execution time drops sharply (more than expected) as the number of processors increase?

I am executing my programme 5000000 times in parallel using "Parallel.For" from F#.
Average execution time per task is given below.
Number of active cores : Execution Time (microseconds)
2 : 866
4 : 424
8 : 210
12 : 140
16 : 106
24 : 76
32 : 60
provided the fact,
by doubling number of cores, maximum speedup which we can get, should be less than than 2 (ideally it can be 2).
what can be the reason for this sharp speedup.
2 * 866 = 1732
4 * 424 = 1696
8 * 210 = 1680
12 * 140 = 1680
16 * 106 = 1696
24 * 76 = 1824
32 * 60 = 1920
So as you increase parallelism, relative performance improves and then begins to fall. The improvement is possibly due to amortization of overhead costs like JIT compilation or in the algorithm that manages the parallelization.
The degradation as the degree of parallelism increases is often due to some sort of resource contention, excessive context switching, or the like.

Slideshow Algorithm

I need to design an algorithm for a photo slideshow that is constantly receiving new images, so that the oldest pictures appear less in the presentation, until a balance between the old photos and those that have appeared.
I have thought that every image could have a counter of the number of times they have been shown and prioritize those pictures with the lowest value in that variable.
Any other ideas or solutions would be well received.
You can achieve an overall near-uniform distribution (each image appears about the same number of times for the long run), but I wouldn't recommend doing it. Images that were available early would appear very very rarely later on. A better user experience would be to simply choose a random image from all the available images at each step.
If you still want near-uniform distribution for the long run, you should set the probability for any image based on the number of times it appeared so far. For example:
p(i) = 1 - count(i) / (max_count() + epsilon)
Here is a simple R code that simulates such process. 37 random images are selected before a new image becomes available. This process is repeated 3000 times:
h <- 3000 # total images
eps <- 0.001
t <- integer(length=h) # t[i]: no. of instances of value i in r
r <- c() # proceded vector of indexes of images
m <- 0 # highest number of appearances for an image
for (i in 1:h)
for (j in 1:37) # select 37 random images in range 1..i
{
v <- sample(1:i, 1, prob=1-t[1:i]/(m+eps)) # select image i with weight 1-t[i]/(m+eps)
r <- c(r, v) # add to output vector
t[v] <- t[v]+1 # update appearances count
m <- max(m, t[v]) # update highest number of appearances
}
plot(table(r))
The output plot shows the number of times each image appeared:
epsilon = 0.001:
epsilon = 0.0001:
If we look, for example at the indexes in the output vector in which, say, image #3 was selected:
> which(r==3)
[1] 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
[21] 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 1189 34767 39377
[41] 70259
Note that if epsilon is very small, the sequence will seem less random (newer images are much preferred). For the long run however, any epsilon will do.
Instead of a view counter, you could also try basing your algorithm on the timestamp that images were uploaded.

Including STDEV in this macro and call

Ive got together the following macro to get a random sample of various sizes then calculate MEAN a given number of times. I would also like calculate the STDEV alongside the MEAN. Ive tried various amendments to the script but im struggling to apply the correct syntax i think. Thanks for any help.
Regards
Andy
DEFINE !sample(myvar !TOKENS(1)
/nbsampl !TOKENS(1)
/size !CMDEND).
myvar = the variable of interest (here we want the mean of salary)
nbsampl = number of samples.
size = the size of each samples.
!LET !first='1'
!DO !ss !IN (!size)
!DO !count = 1 !TO !nbsampl.
GET FILE='E:\Monte carlo testing\s1.sav'.
COMPUTE draw=uniform(1).
SORT CASES BY draw.
N OF CASES !ss.
COMPUTE samplenb=!count.
COMPUTE ss=!ss.
AGGREGATE
/OUTFILE=*
/BREAK=samplenb
/!myvar = MEAN(!myvar) /ss=FIRST(ss).
!IF (!first !NE '1') !THEN
ADD FILES /FILE=* /FILE='E:Monte carlo testing\sample.sav'.
!IFEND
SAVE OUTFILE='E:\Monte carlo testing\sample.sav'.
!LET !first='0'
!DOEND.
!DOEND.
VARIABLE LABEL ss 'Sample size'.
EXAMINE
VARIABLES=salary BY ss /PLOT=BOXPLOT/STATISTICS=NONE/NOTOTAL
/MISSING=REPORT.
!ENDDEFINE.
!sample myvar=VAR00001 nbsampl=200 size= 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100
Does this help?
AGGREGATE
/OUTFILE=*
/BREAK=samplenb
/!myvar = MEAN(!myvar)
/sd = STDEV(!myvar)
/ss=FIRST(ss).

How to optimize the layout of rectangles

I have a dynamic number of equally proportioned and sized rectangular objects that I want to optimally display on the screen. I can resize the objects but need to maintain proportion.
I know what the screen dimensions are.
How can I calculate the optimal number of rows and columns that I will need to divide the screen in to and what size I will need to scale the objects to?
Thanks,
Jamie.
Assuming that all rectangles have the same dimensions and orientation and that such should not be changed.
Let's play!
// Proportion of the screen
// w,h width and height of your rectangles
// W,H width and height of the screen
// N number of your rectangles that you would like to fit in
// ratio
r = (w*H) / (h*W)
// This ratio is important since we can define the following relationship
// nbRows and nbColumns are what you are looking for
// nbColumns = nbRows * r (there will be problems of integers)
// we are looking for the minimum values of nbRows and nbColumns such that
// N <= nbRows * nbColumns = (nbRows ^ 2) * r
nbRows = ceil ( sqrt ( N / r ) ) // r is positive...
nbColumns = ceil ( N / nbRows )
I hope I got my maths right, but that cannot be far from what you are looking for ;)
EDIT:
there is not much difference between having a ratio and the width and height...
// If ratio = w/h
r = ratio * (H/W)
// If ratio = h/w
r = H / (W * ratio)
And then you're back using 'r' to find out how much rows and columns use.
Jamie, I interpreted "optimal number of rows and columns" to mean "how many rows and columns will provide the largest rectangles, consistent with the required proportions and screen size". Here's a simple approach for that interpretation.
Each possible choice (number of rows and columns of rectangles) results in a maximum possible size of rectangle for the specified proportions. Looping over the possible choices and computing the resulting size implements a simple linear search over the space of possible solutions. Here's a bit of code that does that, using an example screen of 480 x 640 and rectangles in a 3 x 5 proportion.
def min (a, b)
a < b ? a : b
end
screenh, screenw = 480, 640
recth, rectw = 3.0, 5.0
ratio = recth / rectw
puts ratio
nrect = 14
(1..nrect).each do |nhigh|
nwide = ((nrect + nhigh - 1) / nhigh).truncate
maxh, maxw = (screenh / nhigh).truncate, (screenw / nwide).truncate
relh, relw = (maxw * ratio).truncate, (maxh / ratio).truncate
acth, actw = min(maxh, relh), min(maxw, relw)
area = acth * actw
puts ([nhigh, nwide, maxh, maxw, relh, relw, acth, actw, area].join("\t"))
end
Running that code provides the following trace:
1 14 480 45 27 800 27 45 1215
2 7 240 91 54 400 54 91 4914
3 5 160 128 76 266 76 128 9728
4 4 120 160 96 200 96 160 15360
5 3 96 213 127 160 96 160 15360
6 3 80 213 127 133 80 133 10640
7 2 68 320 192 113 68 113 7684
8 2 60 320 192 100 60 100 6000
9 2 53 320 192 88 53 88 4664
10 2 48 320 192 80 48 80 3840
11 2 43 320 192 71 43 71 3053
12 2 40 320 192 66 40 66 2640
13 2 36 320 192 60 36 60 2160
14 1 34 640 384 56 34 56 1904
From this, it's clear that either a 4x4 or 5x3 layout will produce the largest rectangles. It's also clear that the rectangle size (as a function of row count) is worst (smallest) at the extremes and best (largest) at an intermediate point. Assuming that the number of rectangles is modest, you could simply code the calculation above in your language of choice, but bail out as soon as the resulting area starts to decrease after rising to a maximum.
That's a quick and dirty (but, I hope, fairly obvious) solution. If the number of rectangles became large enough to bother, you could tweak for performance in a variety of ways:
use a more sophisticated search algorithm (partition the space and recursively search the best segment),
if the number of rectangles is growing during the program, keep the previous result and only search nearby solutions,
apply a bit of calculus to get a faster, precise, but less obvious formula.
This is almost exactly like kenneth's question here on SO. He also wrote it up on his blog.
If you scale the proportions in one dimension so that you are packing squares, it becomes the same problem.
One way I like to do that is to use the square root of the area:
Let
r = number of rectangles
w = width of display
h = height of display
Then,
A = (w * h) / r is the area per rectangle
and
L = sqrt(A) is the base length of each rectangle.
If they are not square, then just multiply accordingly to keep the same ratio.
Another way to do a similar thing is to just take the square root of the number of rectangles. That'll give you one dimension of your grid (i.e. the number of columns):
C = sqrt(n) is the number of columns in your grid
and
R = n / C is the number of rows.
Note that one of these will have to ceiling and the other floor otherwise you will truncate numbers and might miss a row.
Your mention of rows and columns suggests that you envisaged arranging the rectangles in a grid, possibly with a few spaces (e.g. some of the bottom row) unfilled. Assuming this is the case:
Suppose you scale the objects such that (an as-yet unknown number) n of them fit across the screen. Then
objectScale=screenWidth/(n*objectWidth)
Now suppose there are N objects, so there will be
nRows = ceil(N/n)
rows of objects (where ceil is the Ceiling function), which will take up
nRows*objectScale*objectHeight
of vertical height. We need to find n, and want to choose the smallest n such that this distance is smaller than screenHeight.
A simple mathematical expression for n is made trickier by the presence of the ceiling function. If the number of columns is going to be fairly small, probably the easiest way to find n is just to loop through increasing n until the inequality is satisfied.
Edit: We can start the loop with the upper bound of
floor(sqrt(N*objectHeight*screenWidth/(screenHeight*objectWidth)))
for n, and work down: the solution is then found in O(sqrt(N)). An O(1) solution is to assume that
nRows = N/n + 1
or to take
n=ceil(sqrt(N*objectHeight*screenWidth/(screenHeight*objectWidth)))
(the solution of Matthieu M.) but these have the disadvantage that the value of n may not be optimal.
Border cases occur when N=0, and when N=1 and the aspect ratio of the objects is such that objectHeight/objectWidth > screenHeight/screenWidth - both of these are easy to deal with.

Resources