How to get running maximum in Stata? - algorithm

I would like to get the running maximum by writing Stata code.
I think I am quite close:
gen ctrhigh`iv' = max(ctr, L1.ctr, L2.ctr, L3.ctr, ..., L`iv'.ctr)
As you can see, my data are time series and `iv' represents the window (e.g. 5, 10 or 200 days)
The only problem is that you cannot pass a varlist or string containing numbers to max. E.g. the following is not possible:
local ivs 5 10 50 100 200
foreach iv in `ivs' {
local vals
local i = 1
while (`i' <= `iv') {
vals "`vals' `i'"
local ++i
}
gen ctrhigh`iv' = max(varlist vals) //not possible
}
How would I achieve this instead?
Example of quickly computing a running standard deviation
* standard deviation of ctr, see http://en.wikipedia.org/wiki/Standard_deviation#Rapid_calculation_methods *
gen ctr_sq = ctr^2
by tid: gen ctr_cum = sum(ctr) if !missing(ctr)
by tid: gen ctr_sq_cum = sum(ctr_sq) if !missing(ctr_sq)
foreach iv in $ivs {
if `iv' == 1 continue
by tid: gen ctr_sum = ctr_cum - L`iv'.ctr_cum if !missing(ctr_cum) & !missing(L`iv'.ctr_cum)
by tid: gen ctr_sq_sum = ctr_sq_cum - L`iv'.ctr_sq_cum if !missing(ctr_sq_cum) & !missing(L`iv'.ctr_sq_cum)
by tid: gen ctrsd`iv' = sqrt((`iv' * ctr_sq_sum - ctr_sum^2) / (`iv'*(`iv'-1))) if !missing(ctr_sq_sum) & !missing(ctr_sum)
label variable ctrsd`iv' "Rolling std dev of close ticker rank by `iv' days."
drop ctr_sum ctr_sq_sum
}
drop ctr_sq ctr_cum ctr_sq_cum
Note: this is not an exact sd, it's an approximation. I realize that this is very different from a maximum, but this may serve as an illustration on how to deal with large data computations.

Your example is time series data and implies that you have tsset the data. You don't say whether you also have panel or longitudinal structure. I will assume the worst and assume the latter as it doesn't make the code much worse. So, suppose tsset id date. In fact, that's irrelevant to the code here except to make explicit my assumption that id is an identifier and date a time variable.
An unattractive way to do this is to loop over observations. Suppose window is set to 42.
local window = 42
gen max = .
tsset id date
quietly forval i = 1/`=_N' {
su ctr if inrange(date, date[`i'] - `window', date[`i']) & id == id[`i'], meanonly
replace max = r(max) in `i'
}
So, in words as well: summarize values of ctr if date within window and it's in the same panel (same id), and put the maximum in the current observation.
The meanonly option is not well named. It calculates some other quantities besides the mean, and the maximum is one. But you do want the meanonly option to make summarize go as fast as possible.
See my 2007 paper on events in intervals, freely available at http://www.stata-journal.com/sjpdf.html?articlenum=pr0033
I say unattractive, but this approach does have the advantage that it is easy to work with once you understand it.
I am not setting up an expression with lots of arguments to max(). You said 200 as an example and nothing stated that you might not ask for more, so far as I can see there may be no upper limit on window length, but there will be a limit on how complicated that expression can be.
If I think of a better way to do it, I'll post it. Or someone else will....

It seems like I can pass a string of arguments to max, like so:
* OPTION 1: compute running max by days *
foreach iv in $ivs {
* does not make sense for less than two days *
if `iv' < 2 continue
di "computing running max for ctr interval `iv'"
* set high for this amount of days *
local vars "ctr"
forval i = 1 / `iv' {
local vars "`vars', L`i'.ctr"
}
by tid: gen ctrh`iv' = max(`vars')
}
* OPTION 2: compute running max by days, ensuring that entire range is nonmissing *
foreach iv in $ivs {
* does not make sense for less than two days *
if `iv' < 2 continue
di "computing running max for ctr interval `iv'"
* set high for this amount of days *
local vars "ctr"
local condition "!missing(ctr)"
forval i = 1 / `iv' {
local vars "`vars', L`i'.ctr"
local condition "`condition' & !missing(L`i'.ctr)"
}
by tid: gen ctrh`iv' = max(`vars') if `condition'
}
This computes very quickly and does exactly what I need.
However, if you need an arbitrarily large window I think you should resort to Nick's answer.

Related

How do I repeat a random number

I've tried searching for help but I haven't found a solution yet, I'm trying to repeat math.random.
current code:
local ok = ""
for i = 0,10 do
local ok = ok..math.random(0,10)
end
print(ok)
no clue why it doesn't work, please help
Long answer
Even if the preferable answer is already given, just copying it will probably not lead to the solution you may expect or less future mistakes. So I decided to explain why your code fails and to fix it and also help better understand how DarkWiiPlayer's answer works (except for string.rep and string.gsub).
Issues
There are at least three issues in your code:
the math.random(m, n) function includes lower and the upper values
local declarations hide a same-name objects in outer scopes
math.random gives the same number sequence unless you set its seed with math.randomseed
See Detailed explanation section below for more.
Another point seems at least worth mentioning or suspicious to me, as I assume you might be puzzled by the result (it seems to me to reflect exactly the perspective of the C programmer, from which I also got to know Lua): the Lua for loop specifies start and end value, so both of these values are included.
Attempt to repair
Here I show how a version of your code that yields the same results as the answer you accepted: a sequence of 10 percent-encoded decimal digits.
-- this will change the seed value (but mind that its resolution is seconds)
math.randomseed(os.time())
-- initiate the (only) local variable we are working on later
local ok = ""
-- encode 10 random decimals (Lua's for-loop is one-based and inclusive)
for i = 1, 10 do
ok = ok ..
-- add fixed part
'%3' ..
-- concatenation operator implicitly converts number to string
math.random(0, 9) -- a random number in range [0..9]
end
print(ok)
Detailed explanation
This explanation makes heavily use of the assert function instead of adding print calls or comment what the output should be. In my opinion assert is the superior choice for illustrating expected behavior: The function guides us from one true statement - assert(true) - to the next, at the first miss - assert(false) - the program is exited.
Random ranges
The math library in Lua provides actually three random functions depending on the count of arguments you pass to it. Without arguments, the result is in the interval [0,1):
assert(math.random() >= 0)
assert(math.random() < 1)
the one-argument version returns a value between 1 and the argument:
assert(math.random(1) == 1)
assert(math.random(10) >= 1)
assert(math.random(10) <= 10)
the two-argument version explicitly specifies min and max values:
assert(math.random(2,2) == 2)
assert(math.random(0, 9) >= 0)
assert(math.random(0, 9) <= 9)
Hidden outer variable
In this example, we have two variables x of different type, the outer x is not accessible from the inner scope.
local x = ''
assert(type(x) == 'string')
do
local x = 0
assert(type(x) == 'number')
-- inner x changes type
x = x .. x
assert(x == '00')
end
assert(type(x) == 'string')
Predictable randomness
The first call to math.random() in a Lua program will return always the same number because the pseudorandom number generator (PRNG) starts at seed 1. So if you call math.randomseed(1), you'll reset the PRNG to its initial state.
r0 = math.random()
math.randomseed(1)
r1 = math.random()
assert(r0 == r1)
After calling math.randomseed(os.time()) calls to math.random() will return different sequences presuming that subsequent program starts differ at least by one second. See question Current time in milliseconds and its answers for more information about the resolutions of several Lua functions.
string.rep(".", 10):gsub(".", function() return "%3" .. math.random(0, 9) end)
That should give you what you want

Algorithm or Test Method to generate test case for Keno game

Keno Game rules: Keno is a lottery-like game which generates random combination of number ranging from 1 to 80 with the size of 20. Player may choose a number game to play (1,2,3,4,5,6,7,8,9,10,15). The payout depends on the number game and number of matches.
I understand the difficulties of generating a complete test case to cover all possible combination not to mention the possibility of matching the random game result. Therefore, I initially applied the Random Combination testing method but later found out it is hard to achieve high coverage of all possible cases (roughly about 10%). By now, I have come across Pure Random Combinatorial, CATS, AETG, K-combination but none is ideal for Keno game.
For now, the inputs are num_game_size, numSelected[num_game_size]. Meanwhile, the outputs are result[20], matchedNum[], matched_num_size, payout. Of course, there are more inputs: continuous_game_toplay_size, bet_amount.
I'm looking forward for any suggestion on any testing method or algorithm that has high coverage on pure random and large combination test case if executed for a month or two. My objective is to test combination of selected numbers and their payout for each different number of matches when the result is pure random generated. For instance:
/* Assume the result is pure random generated */
/* Match 0 */
num_game_size = 2
numSelected[2] = {1,72}
result[20] = {2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21}
matchedNum[] = {}
matched_num_size = 0
payout = 0
/* Match 1 */
num_game_size = 2
numSelected[2] = {1,72}
result[20] = {1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21}
matchedNum[] = {1}
matched_num_size = 1
payout = 1
/* Match 2 */
num_game_size = 2
numSelected[2] = {1,72}
result[20] = {1,72,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21}
matchedNum[] = {1,72}
matched_num_size = 2
payout = 5
The total possibility will be C(80,2) * C(80,20) = 3160 * 3535316142212174320 = 1.117159900939047e+19. Meaning for each combination of number with the size of two within range of 1 to 80, there are C(80,20) possible results. It will probably takes few years to cover all possibility (including 1,3,4,5,6,7,8,9,10,15 number game) when the result is pure random generated (quantum RNG).
Ps: Most test method I found only consider either random or combination problem and require a tremendous amount of time to complete test case generation. I'm trying to create any program to help me in winning the Keno game IRL.

Random unique string against a blacklist

I want to create a random string of a fixed length (8 chars in my use case) and the generated string has to be case sensitive and unique against a blacklist. I know this sounds like a UUID but I have a specific requirement that prevents me from utilizing them
some characters are disallowed, i.e. I, l and 1 are lookalikes, and O and 0 as well
My initial implementation is solid and solves the task but performs poorly. And by poorly I mean it is doomed to be slower and slower every day.
This is my current implementation I want to optimize:
private function uuid()
{
$chars = 'ABCDEFGHJKLMNPQRSTVUWXYZabcdefghijkmnopqrstvuwxyz23456789';
$uuid = null;
while (true) {
$uuid = substr(str_shuffle($chars), 0, 8);
if (null === DB::table('codes')->select('id')->whereRaw('BINARY uuid = ?', [$uuid])->first())) {
break;
}
}
return $uuid;
}
Please spare me the critique, we live in an agile world and this implementation is functional and is quick to code.
With a small set of data it works beautifully. However if I have 10 million entries in the blacklist and try to create 1000 more it fails flat as it takes 30+ minutes.
A real use case would be to have 10+ million entries in the DB and to attempt to create 20 thousand new unique codes.
I was thinking of pre-seeding all allowed values but this would be insane:
(24+24+8)^8 = 9.6717312e+13
It would be great if the community can point me in the right direction.
Best,
Nikola
Two options:
Just use a hash of something unique, and truncate so it fits in the bandwidth of your identifier. Hashes sometimes collide, so you will still need to check the database and retry if a code is already in use.
s = "This is a string that uniquely identifies voucher #1. Blah blah."
h = hash(s)
guid = truncate(hash)
Generate five of the digits from an incrementing counter and three randomly. A thief will have a worse than 1 in 140,000 chance of guessing a code, depending on your character set.
u = Db.GetIncrementingCounter()
p = Random.GetCharacters(3)
guid = u + p
I ended up modifying the approach: instead of checking for uuid existence on every loop, e.g. 50K DB checks, I now split the generated codes into multiple chunks of 1000 codes and issue an INSERT IGNORE batch query within a transaction.
If the affected rows are as many as the items (1000 in this case) I know there wasn't a collision and I can commit the transaction. Otherwise I need to rollback the chunk and generate another 1000 codes.

Neural network - solve a net with time arrays and different sample rate

I have 3 measurements for a machine. Each measurement is trigged every time its value changes by a certain delta.
I have these 3 data sets, represented as Matlab objects: T1, T2 and O. Each of them has a obj.t containing the timestamp values and obj.y containing the measurement values.
I will measure T1 and T2 for a long time, but O only for a short period. The task is to reconstruct O_future from T1 and T2, using the existing values for O for training and validation.
Note that T1.t, T2.t and O.t are not equal, not even their frequency (I might call it 'variable sample rate', but not sure if this name applies).
Is it possible to solve this problem using Matlab or other software? Do I need to resample all data to a common time vector?
Concerning the common time. Below some basic code which does this. (I guess you might know how to do it but just in case). However, the second option might bring you further...
% creating test signals
t1 = 1:2:100;
t2 = 1:3:200;
to = [5 6 100 140];
s1 = round (unifrnd(0,1,size(t1)));
s2 = round (unifrnd(0,1,size(t2)));
o = ones(size(to));
maxt = max([t1 t2 to]);
mint = min([t1 t2 to]);
% determining minimum frequency
frequ = min([t1(2:length(t1)) - t1(1:length(t1)-1) t2(2:length(t2)) - t2(1:length(t2)-1) to(2:length(to)) - to(1:length(to)-1)] );
% create a time vector with highest resolution
tinterp = linspace(mint,maxt,(maxt-mint)/frequ+1);
s1_interp = zeros(size(tinterp));
s2_interp = zeros(size(tinterp));
o_interp = zeros(size(tinterp));
for i = 1: length(t1)
s1_interp(ceil(t1(i))==floor(tinterp)) =s1(i);
end
for i = 1: length(t2)
s2_interp(ceil(t2(i))==floor(tinterp)) =s2(i);
end
for i = 1: length(to)
o_interp(ceil(to(i))==floor(tinterp)) = o(i);
end
figure,
subplot 311
hold on, plot(t1,s1,'ro'), plot(tinterp,s1_interp,'k-')
legend('observation','interpolation')
title ('signal 1')
subplot 312
hold on, plot(t2,s2,'ro'), plot(tinterp,s2_interp,'k-')
legend('observation','interpolation')
title ('signal 2')
subplot 313
hold on, plot(to,o,'ro'), plot(tinterp,o_interp,'k-')
legend('observation','interpolation')
title ('O')
Its not ideal as for large vectors this might become ineffective as soon as you have small sampling frequencies in one of the signals which will determine the lowest resolution.
Another option would be to define a coarser time vector and look at the number of events that happend in a certain period which might have some predictive power as well (not sure about your setup).
The structure would be something like
coarse_t = 1:5:100;
s1_coarse = zeros(size(coarse_t));
s2_coarse = zeros(size(coarse_t));
o_coarse = zeros(size(coarse_t));
for i = 2:length(coarse_t)
s1_coarse(i) = sum(nonzeros(s1(t1<coarse_t(i) & t1>coarse_t(i-1))));
s2_coarse(i) = sum(nonzeros(s2(t2<coarse_t(i) & t2>coarse_t(i-1))));
o_coarse(i) = sum(nonzeros(o(to<coarse_t(i) & to>coarse_t(i-1))));
end

Build fixed interval dataset from random interval dataset using stale data

Update: I've provided a brief analysis of the three answers at the bottom of the question text and explained my choices.
My Question: What is the most efficient method of building a fixed interval dataset from a random interval dataset using stale data?
Some background: The above is a common problem in statistics. Frequently, one has a sequence of observations occurring at random times. Call it Input. But one wants a sequence of observations occurring say, every 5 minutes. Call it Output. One of the most common methods to build this dataset is using stale data, i.e. set each observation in Output equal to the most recently occurring observation in Input.
So, here is some code to build example datasets:
TInput = 100;
TOutput = 50;
InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1));
Input = [InputTimeStamp, randn(TInput, 1)];
OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)';
Output = [OutputTimeStamp, NaN(TOutput, 1)];
Both datasets start at close to midnight at the turn of the millennium. However, the timestamps in Input occur at random intervals while the timestamps in Output occur at fixed intervals. For simplicity, I have ensured that the first observation in Input always occurs before the first observation in Output. Feel free to make this assumption in any answers.
Currently, I solve the problem like this:
sMax = size(Output, 1);
tMax = size(Input, 1);
s = 1;
t = 2;
%#Loop over input data
while t <= tMax
if Input(t, 1) > Output(s, 1)
%#If current obs in Input occurs after current obs in output then set current obs in output equal to previous obs in input
Output(s, 2:end) = Input(t-1, 2:end);
s = s + 1;
%#Check if we've filled out all observations in output
if s > sMax
break
end
%#This step is necessary in case we need to use the same input observation twice in a row
t = t - 1;
end
t = t + 1;
if t > tMax
%#If all remaining observations in output occur after last observation in input, then use last obs in input for all remaining obs in output
Output(s:end, 2:end) = Input(end, 2:end);
break
end
end
Surely there is a more efficient, or at least, more elegant way to solve this problem? As I mentioned, this is a common problem in statistics. Perhaps Matlab has some in-built function I'm not aware of? Any help would be much appreciated as I use this routine a LOT for some large datasets.
THE ANSWERS: Hi all, I've analyzed the three answers, and as they stand, Angainor's is the best.
ChthonicDaemon's answer, while clearly the easiest to implement, is really slow. This is true even when the conversion to a timeseries object is done outside of the speed test. I'm guessing the resample function has a lot of overhead at the moment. I am running 2011b, so it is possible Mathworks have improved it in the intervening time. Also, this method needs an additional line for the case where Output ends more than one observation after Input.
Rody's answer runs only slightly slower than Angainor's (unsurprising given they both employ the histc approach), however, it seems to have some problems. First, the method of assigning the last observation in Output is not robust to the last observation in Input occurring after the last observation in Output. This is an easy fix. But there is a second problem which I think stems from having InputTimeStamp as the first input to histc instead of the OutputTimeStamp adopted by Angainor. The problem emerges if you change OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)'; to OutputTimeStamp = 730486.002 + (0:0.0001:TOutput * 0.0001 - 0.0001)'; when setting up the example inputs.
Angainor's appears robust to everything I threw at it, plus it was the fastest.
I did a lot of speed tests for different input specifications - the following numbers are fairly representative:
My naive loop: Elapsed time is 8.579535 seconds.
Angainor: Elapsed time is 0.661756 seconds.
Rody: Elapsed time is 0.913304 seconds.
ChthonicDaemon: Elapsed time is 22.916844 seconds.
I'm +1-ing Angainor's solution and marking the question solved.
This "stale data" approach is known as a zero order hold in signal and timeseries fields. Searching for this quickly brings up many solutions. If you have Matlab 2012b, this is all built in to the timeseries class by using the resample function, so you would simply do
TInput = 100;
TOutput = 50;
InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1));
InputData = randn(TInput, 1);
InputTimeSeries = timeseries(InputData, InputTimeStamp);
OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001);
OutputTimeSeries = resample(InputTimeSeries, OutputTimeStamp, 'zoh'); % zoh stands for zero order hold
Here is my take on the problem. histc is the way to go:
% find Output timestamps in Input bins
N = histc(Output(:,1), Input(:,1));
% find counts in the non-empty bins
counts = N(find(N));
% find Input signal value associated with every bin
val = Input(find(N),2);
% now, replicate every entry entry in val
% as many times as specified in counts
index = zeros(1,sum(counts));
index(cumsum([1 counts(1:end-1)'])) = 1;
index = cumsum(index);
val_rep = val(index)
% finish the signal with last entry from Input, as needed
val_rep(end+1:size(Output,1)) = Input(end,2);
% done
Output(:,2) = val_rep;
I checked against your procedure for a few different input models (I changed the number of Output timestamps) and the results are the same. However, I am still not sure I understood your problem, so if something is wrong here let me know.

Resources