Including STDEV in this macro and call - syntax

Ive got together the following macro to get a random sample of various sizes then calculate MEAN a given number of times. I would also like calculate the STDEV alongside the MEAN. Ive tried various amendments to the script but im struggling to apply the correct syntax i think. Thanks for any help.
Regards
Andy
DEFINE !sample(myvar !TOKENS(1)
/nbsampl !TOKENS(1)
/size !CMDEND).
myvar = the variable of interest (here we want the mean of salary)
nbsampl = number of samples.
size = the size of each samples.
!LET !first='1'
!DO !ss !IN (!size)
!DO !count = 1 !TO !nbsampl.
GET FILE='E:\Monte carlo testing\s1.sav'.
COMPUTE draw=uniform(1).
SORT CASES BY draw.
N OF CASES !ss.
COMPUTE samplenb=!count.
COMPUTE ss=!ss.
AGGREGATE
/OUTFILE=*
/BREAK=samplenb
/!myvar = MEAN(!myvar) /ss=FIRST(ss).
!IF (!first !NE '1') !THEN
ADD FILES /FILE=* /FILE='E:Monte carlo testing\sample.sav'.
!IFEND
SAVE OUTFILE='E:\Monte carlo testing\sample.sav'.
!LET !first='0'
!DOEND.
!DOEND.
VARIABLE LABEL ss 'Sample size'.
EXAMINE
VARIABLES=salary BY ss /PLOT=BOXPLOT/STATISTICS=NONE/NOTOTAL
/MISSING=REPORT.
!ENDDEFINE.
!sample myvar=VAR00001 nbsampl=200 size= 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100

Does this help?
AGGREGATE
/OUTFILE=*
/BREAK=samplenb
/!myvar = MEAN(!myvar)
/sd = STDEV(!myvar)
/ss=FIRST(ss).

Related

Issue with Lua Random Number Generation in Loops

I have a script for a rock-paper-scissors (RPS) game I am making, and I am trying to generate a random number to determine a series of RPS moves. The logic is as follows:
moves = {}
table.insert(moves, 'rock')
table.insert(moves, 'paper')
table.insert(moves, 'scissors')
currentMoves = {}
math.randomseed(playdate.getSecondsSinceEpoch()) -- game SDK library function that returns seconds since midnight January 1 2000 UTC to initialize new random sequence
math.random(); math.random(); math.random();
-- generates a list of rps moves to display on the screen
function generateMoves(maxMovesLength) -- i set maxMovesLength to 3
currentMoves = {}
for i = 1, maxMovesLength, 1 do
randomNumber = math.random(1, 3)
otherRandomNumber = math.random(1,99) -- even with this, based on the presumption 1~33 is rock, 34~66 is paper, 67~99 is scissors, I get a suspicious number of 3 of the same move)
print(otherRandomNumber)
table.insert(currentMoves, moves[randomNumber])
end
return currentMoves
end
However, I noticed that using the Lua math.random() function, I seem to be getting a statistically unlikely number of series of 3 of the same RPS move. The likelihood of getting 3 of the same move (rock rock rock, paper paper paper, or scissors scissors scissors) should be about 11%, but I am getting sets of 3 much more often.
For example, here is what I got when I set maxMovesLength to 15:
36 -paper
41 -paper
60 -paper
22 -rock
1 -rock
2 -rock
91 -scissors
36 -paper
69 -scissors
76 -scissors
35 -paper
18 -rock
22 -rock
22 -rock
92 -scissors
From this sample, it seems that sets of 3 of a kind are happening much more often than they should be. There are 13 series of 3 moves in this list of 15 moves, and among those 3/13 are three of a kind which would be a probability of about 23%, higher than the expected statistical probability of 11%.
Is this just a flaw in the Lua math library?
It seems that when setting maxMovesLength to a very high number this issue doesn't exist, so I will just call math.random() a bunch of times before I actually use it in my game (more than the 3 times I currently do under randomseed().

Generate an unique identifier such as a hash every N minutes. But they have to be the same in N minutes timeframe without storing data

I want to create a unique identifier such as a small hash every N minutes but the result should be the same in the N timeframe without storing data.
Examples when the N minute is 10:
0 > 10 = 25ba38ac9
10 > 20 = 900605583
20 > 30 = 6156625fb
30 > 40 = e130997e3
40 > 50 = 2225ca027
50 > 60 = 3b446db34
Between minute 1 and 10 i get "25ba38ac9" but with anything between 10 and 20 i get "900605583" etc.
I have no start/example code because i have no idea or algorithm i can use to create the desired result.
I did not provide a specific tag or language in this question because i am interested in the logic and not the final code. But i appreciate documented code as a example.
Pick your favourite hash-function h. Pick your favourite string sugar. To get a hash at time t, append the euclidean quotient of t divided by N to sugar, and apply h to it.
Example in python:
h = lambda x: hex(abs(hash(x)))
sugar = 'Samir'
def hash_for_N_minutes(t, N=10):
return h(sugar + str(t // N))
for t in range(0, 30, 2):
print(t, hash_for_N_minutes(t, 10))
Output:
0 0xeb3d3abb787c890
2 0xeb3d3abb787c890
4 0xeb3d3abb787c890
6 0xeb3d3abb787c890
8 0xeb3d3abb787c890
10 0x45e2d2a970323e9f
12 0x45e2d2a970323e9f
14 0x45e2d2a970323e9f
16 0x45e2d2a970323e9f
18 0x45e2d2a970323e9f
20 0x334dce1d931e5da8
22 0x334dce1d931e5da8
24 0x334dce1d931e5da8
26 0x334dce1d931e5da8
28 0x334dce1d931e5da8
Weakness of this hash method, and suggested improvement
Of course, nothing stops you from inputting a time in the future. So you can easily answer the question "What will the hash be in exactly one hour ?".
If you want future hashes to be unpredictable, you can combine the value t // N with a real-world value that's dependent on the time, not known in advance, but for which we keep records.
There are two well-known time-series that fit this criterion: values related to the meteo, and values related to the stock market.
See also this 2008 xkcd comic: https://xkcd.com/426/

Why average loss goes up when training using Vowpal Wabbit

I tried to use VW to train a regression model on a small set of examples (about 3112). I think I'm doing it correctly, yet it showed me weird results. Dug around but didn't find anything helpful.
$ cat sh600000.feat | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.040000 0.040000 1 1.0 -0.2000 0.0000 79
0.051155 0.062310 2 2.0 0.2000 -0.0496 79
0.046606 0.042056 4 4.0 0.4100 0.1482 79
0.052160 0.057715 8 8.0 0.0200 0.0021 78
0.064936 0.077711 16 16.0 -0.1800 0.0547 77
0.060507 0.056079 32 32.0 0.0000 0.3164 79
0.136933 0.213358 64 64.0 -0.5900 -0.0850 79
0.151692 0.166452 128 128.0 0.0700 0.0060 79
0.133965 0.116238 256 256.0 0.0900 -0.0446 78
0.179995 0.226024 512 512.0 0.3700 -0.0217 79
0.109296 0.038597 1024 1024.0 0.1200 -0.0728 79
0.579360 1.049425 2048 2048.0 -0.3700 -0.0084 79
0.485389 0.485389 4096 4096.0 1.9600 0.3934 79 h
0.517748 0.550036 8192 8192.0 0.0700 0.0334 79 h
finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506
$ wc model
41 48 657 model
Questions:
Why is the number of features in the output (readable) model less than the number of actual features? I counted that the training data contains 78 features (plus the bias that's 79 as shown during the training). The number of feature bits is 24, which should be far more than enough to avoid collision.
Why does the average loss actually go up in the training as you can see in the above example?
(Minor) I tried to increase the number of feature bits to 32, and it output an empty model. Why?
EDIT:
I tried to shuffle the input file, as well as using --holdout_off, as suggested. But the result is still almost the same - the average loss go up.
$ cat sh600000.feat.shuf | vw --l1 1e-8 --l2 1e-8 --readable_model model -b 24 --passes 10 --cache_file cache --holdout_off
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.040000 0.040000 1 1.0 -0.2000 0.0000 79
0.051155 0.062310 2 2.0 0.2000 -0.0496 79
0.046606 0.042056 4 4.0 0.4100 0.1482 79
0.052160 0.057715 8 8.0 0.0200 0.0021 78
0.071332 0.090504 16 16.0 0.0300 0.1203 79
0.043720 0.016108 32 32.0 -0.2200 -0.1971 78
0.142895 0.242071 64 64.0 0.0100 -0.1531 79
0.158564 0.174232 128 128.0 0.0500 -0.0439 79
0.150691 0.142818 256 256.0 0.3200 0.1466 79
0.197050 0.243408 512 512.0 0.2300 -0.0459 79
0.117398 0.037747 1024 1024.0 0.0400 0.0284 79
0.636949 1.156501 2048 2048.0 1.2500 -0.0152 79
0.363364 0.089779 4096 4096.0 0.1800 0.0071 79
0.477569 0.591774 8192 8192.0 -0.4800 0.0065 79
0.411068 0.344567 16384 16384.0 0.0700 0.0450 77
finished run
number of examples per pass = 3112
passes used = 10
weighted example sum = 31120
weighted label sum = -105.5
average loss = 0.423404
best constant = -0.0033901
total feature number = 2451800
The training examples are unique to each other so I doubt there is over-fitting problem (which, as I understand it, usually happens when the number of input is too small comparing the number of features).
EDIT2:
Tried to print the average loss for every pass of examples, and see that it mostly remains constant.
$ cat dist/sh600000.feat | vw --l1 1e-8 --l2 1e-8 -f dist/model -P 3112 --passes 10 -b 24 --cache_file dist/cache
using l1 regularization = 1e-08
using l2 regularization = 1e-08
Num weight bits = 24
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.498822 0.498822 3112 3112.0 0.0800 0.0015 79 h
0.476677 0.454595 6224 6224.0 -0.2200 -0.0085 79 h
0.466413 0.445856 9336 9336.0 0.0200 -0.0022 79 h
0.490221 0.561506 12448 12448.0 0.0700 -0.1113 79 h
finished run
number of examples per pass = 2847
passes used = 5
weighted example sum = 14236
weighted label sum = -155.98
average loss = 0.490685 h
best constant = -0.0109567
total feature number = 1121506
Also another try without the --l1, --l2 and -b parameters:
$ cat dist/sh600000.feat | vw -f dist/model -P 3112 --passes 10 --cache_file dist/cacheNum weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
final_regressor = dist/model
using cache_file = dist/cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.520286 0.520286 3112 3112.0 0.0800 -0.0021 79 h
0.488581 0.456967 6224 6224.0 -0.2200 -0.0137 79 h
0.474247 0.445538 9336 9336.0 0.0200 -0.0299 79 h
0.496580 0.563450 12448 12448.0 0.0700 -0.1727 79 h
0.533413 0.680958 15560 15560.0 -0.1700 0.0322 79 h
0.524531 0.480201 18672 18672.0 -0.9800 -0.0573 79 h
finished run
number of examples per pass = 2801
passes used = 7
weighted example sum = 19608
weighted label sum = -212.58
average loss = 0.491739 h
best constant = -0.0108415
total feature number = 1544713
Does that mean it's normal for average loss to go up during one pass, but as long as multiple pass gets the same loss then it's fine?
Model file stores only non-zero weights. So most likely others got nulled especially if you are using --l1
It may be caused by many reasons. Perhaps your dataset isn't shuffled well enough. If you sort your dataset so examples labeled -1 will be in first half and examples labeled 1 will be in second then your model will show very good convergence on first half, but you'll see avg loss bump as it reaches 2nd half. So it may be disbalance in dataset. As for last two losses - these are holdout losses (marked with 'h' at end of line) and may point that model is overfitted. Pls refer to my other answer.
Well, in master branch usage of -b 32 is even currently blocked. You shall use up to -b 31. On practice -b 24-28 is usually enough even for dozens of thousands of features.
I would recommend you to get up-to-date VW version from github

How to calculate Sub matrix of a matrix

I was giving a test for a company called Code Nation and came across this question which asked me to calculate how many times a number k appears in the submatrix M[n][n]. Now there was a example which said Input like this.
5
1 2 3 2 5
36
M[i][j] is to calculated by a[i]*a[j]
which on calculation turn I could calculate.
1,2,3,2,5
2,4,6,4,10
3,6,9,6,15
2,4,6,4,10
5,10,15,10,25
Now I had to calculate how many times 36 appears in sub matrix of M.
The answer was 5.
I am unable to comprehend how to calculate this submatrix. How to represent it?
I had a naïve approach which resulted in many matrices of which I think none are correct.
One of them is Submatrix[i][j]
1 2 3 2 5
3 9 18 24 39
6 18 36 60 99
15 33 69 129 228
33 66 129 258 486
This was formed by adding all the numbers before it 0,0 to i,j
In this 36 did not appear 5 times so i know this is incorrect. If you can back it up with some pseudo code it will be icing on the cake.
Appreciate the help
[Edit] : Referred Following link 1 link 2
My guess is that you have to compute how many submatrices of M have sum equal to 36.
Here is Matlab code:
a=[1,2,3,2,5];
n=length(a);
M=a'*a;
count = 0;
for a0 = 1:n
for b0 = 1:n
for a1 = a0:n
for b1 = b0:n
A = M(a0:a1,b0:b1);
if (sum(A(:))==36)
count = count + 1;
end
end
end
end
end
count
This prints out 5.
So you are correctly computing M, but then you have to consider every submatrix of M, for example, M is
1,2,3,2,5
2,4,6,4,10
3,6,9,6,15
2,4,6,4,10
5,10,15,10,25
so one possible submatrix is
1,2,3
2,4,6
3,6,9
and if you add up all of these, then the sum is equal to 36.
There is an answer on cstheory which gives an O(n^3) algorithm for this.

How to calculate one certain value from a rolling-window estimation in Stata

I'm using Stata to estimate Value-at-risk (VaR) with the historical simulation method. Basically, I will create a rolling window with 100 observations, to estimate VaR for the next 250 days (repeat 250 times). Hence, as I've known, the rolling window with time series command in Stata would be useful in this case. Here is the process:
Input: 350 values
1. Ascending sort the very first 100 values (by magnitude).
2. Then I need to take the 5th smallest for each window.
3. Repeat 250 times.
Output: a list of the 5th values (250 in total).
Sound simple, but I cannot do it the right way. This was my attempt below:
program his,rclass
sort lnreturn
return scalar actual=lnreturn in 5
end
tsset stt
time variable: stt, 1 to 350
delta: 1 unit
rolling actual=r(actual), window(100) saving(C:\result100.dta, replace) : his
(running his on estimation sample)
And the result is:
Start end actual
1 100 -.047856
2 101 -.047856
3 102 -.047856
4 103 -.047856
.... ..... ......
251 350 -.047856
What I want is 250 different 5th values in panel "actual", not the same like that.
If I understand this correctly, you want the 5th percentile of values in a window of 100. That should yield to summarize, detail or centile. I see no need to write a program.
Your bug is that your program his calculates the same thing each time it is called. There is no communication about windows other than what is explicit in your code. It is like saying
move here: now add 2 + 2
move there: now add 2 + 2
move to New York: now add 2 + 2
The result is invariant to your supposed position.
Note that I doubt that
return scalar actual=lnreturn in 5
really is your code. lnreturn[5] should work.
UPDATE You don't even need rolling here. Looping over data is easy enough. The data in this example are clearly fake.
clear
* sandpit
set obs 500
set seed 2803
gen y = ceil(exp(rnormal(3,2)))
l y in 1/5
* initialise
gen p5 = .
* windows of length 100: 1..100, 101..200, ...
quietly forval j = 1/401 {
local J = `j' + 99
su y in `j'/`J', detail
replace p5 = r(p5) in `j'
}
* check first calculation
su y in 1/100, detail
l in 1/5

Resources