Reset random number stream - random

It appears that SAS/IML has the ability to reset it's random number stream (doc link) .
Is there a similar feature for random number routines in the SAS data step?
Based on this post, it seems that subsequent calls to streaminit are ignored within a single datastep.
For example, the below code produces different random numbers for each row:
data want;
do i = 1 to 2;
call streaminit(123); * <-- WANT THIS TO RESET THE STREAM;
ran1 = rand('uniform');
ran2 = rand('uniform');
ran3 = rand('uniform');
put _all_;
output;
end;
run;
Output:
i=1 ran1=0.5817000773 ran2=0.0356216603 ran3=0.0781806207
i=2 ran1=0.3878454913 ran2=0.3291709244 ran3=0.3615948586
I would like the output to be:
i=1 ran1=0.5817000773 ran2=0.0356216603 ran3=0.0781806207
i=2 ran1=0.5817000773 ran2=0.0356216603 ran3=0.0781806207

You cannot reset the streams for the RAND function in SAS 9.4M4. However, you can rewind a stream in SAS 9.4M5 (which shipped in Sep 2017) by using the new STREAMREWIND routine. The following program shows the syntax:
data want;
call streaminit(123);
do i = 1 to 2;
call streamrewind;
ran1 = rand('uniform');
ran2 = rand('uniform');
ran3 = rand('uniform');
put _all_;
output;
end;
run;

You could work around this using generated code, though, with CALL EXECUTE or perhaps DOSUBL, for example:
data _null_;
do i = 1 to 2;
rc=dosubl(cats("data want_",i,";
call streaminit(123); * <-- WANT THIS TO RESET THE STREAM;
ran1 = rand('uniform');
ran2 = rand('uniform');
ran3 = rand('uniform');
i=",i,";
put _all_;
output;
run;
"));
end;
rc = dosubl("data want; set want_1 want_2; run;");
run;
Obviously easier/better to write a macro to do this part.
This is a limitation unfortunately of the 'new' RAND routine; the old one was much easier to work with in this regard (as it actually truly had just one seed). The new one's seed properties are more complex, and so while you can initialize it with a single number, it's not as straightforward, hence the complications.

You can use call ranuni to use the same seed for two different random number streams.
Note that this uses a different, inferior PRNG, with a much shorter cycle and poorer statistical properties than the rand() function.
Example:
data x;
seed1 = 123;
seed2 = 123;
do i =1 to 3;
call ranuni(seed1, x);
call ranuni(seed2, y);
output;
end;
run;
Output:
i=1 x=0.7503960881 y=0.7503960881
i=2 x=0.3209120251 y=0.3209120251
i=3 x=0.178389649 y=0.178389649

Related

How to rank multiple variables in a large data set?

I have a data set of around 50 million records with around 30 variables(columns).
I need to rank the dataset for each variable.
Proc rank does not work since it required lot of memory for this large dataset.
To give rank manually, I have to sort the dataset on the respective variable column and then give rank by using a formula. But the problem is we have to sort the dataset 30 times on 30 variables which will take very very long time and not feasible.
What alternates can we use in this case?
You're in a tough spot without many options. If you're sorting and keeping all 30 variables each time, that will significantly increase your processing times. If I were you, I'd only keep the variable you want to rank and a sequence number to apply your formula, then merge it all back together at the end. This would require you to loop over each variable in your dataset then merge it all back together. See example below and if it would help decrease your processing times:
** PUT ALL VARIABLES INTO LIST **;
PROC SQL NOPRINT;
SELECT DISTINCT(NAME)
INTO :VARS SEPARATED BY " "
FROM DICTIONARY.COLUMNS
WHERE LIBNAME = 'SASHELP' AND MEMNAME = 'FISH';
QUIT;
%PUT &VARS.;
** CREATE SEQUENCE NUMBER IN FULL DATA **;
DATA FISH; SET SASHELP.FISH;
SEQ=_N_;
RUN;
** LOOP OVER EACH VARIABLE TO ONLY PROCESS THAT VARIABLE AND SEQUENCE -- REDUCES PROCESSING TIME **;
%MACRO LOOP_OVER(VARS);
%DO I=1 %TO %SYSFUNC(COUNTW(&VARS.));
%LET VAR = %SCAN(&VARS,&I.);
DATA FISH_&I.; SET FISH (KEEP=SEQ &VAR.);
RUN;
/* INSERT YOUR FORMULA CODE HERE ON FISH_&I. DATA (MINE IS AN EXAMPLE) */
PROC SORT DATA = FISH_&I.;
BY &VAR.;
RUN;
DATA FISH1_&I.; SET FISH_&I.;
BY &VAR.;
RANK_&VAR = _N_;
RUN;
/* RESORT FINAL DATA BY SEQUENCE NUMBER VARIABLE */
PROC SORT DATA = FISH1_&I.;
BY SEQ;
RUN;
%END;
%MEND;
%LOOP_OVER(&VARS.);
** MERGE ALL SUBSETS BACK TOGETHER BY THE ORIGINAL SEQUENCE NUMBER **;
DATA FINAL;
MERGE FISH1_:;
BY SEQ;
DROP SEQ;
RUN;
If you just need to rank into deciles / percentiles etc rather than a complete ranking from 1 to 50m across all 50m rows, you should be able to get a very good approximation of the correct answer using a much smaller amount of memory via proc summary, using qmethod=P2 and specifying a suitable qmarkers setting.
This approach uses the P-squared algorithm:
http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf
I am not sure, whether it is a good idea: But you may want to use a Hash object. The object is loaded into your RAM. Assuming that you have 30 Mio of numerical observations, you will need around (2*8bytes)*50 mio = 800MB of RAM -- if I am not mistaken.
The code could look like this (using Foxers Macro to loop over the variables, a little helper macro to get the list of variables from a dataset and a small test dataset with two variables):
%Macro GetVars(Dset) ;
%Local VarList ;
/* open dataset */
%Let FID = %SysFunc(Open(&Dset)) ;
/* If accessable, process contents of dataset */
%If &FID %Then %Do ;
%Do I=1 %To %SysFunc(ATTRN(&FID,NVARS)) ;
%Let VarList= &VarList %SysFunc(VarName(&FID,&I));
%End ;
/* close dataset when complete */
%Let FID = %SysFunc(Close(&FID)) ;
%End ;
&VarList
%Mend ;
data dsn;
input var1 var2;
datalines;
1 48
1 8
2 5
2 965
3 105
4 105
3 85
;
run;
%MACRO LOOP_OVER(VARS);
%DO I=1 %TO %SYSFUNC(COUNTW(&VARS.));
%LET var = %SCAN(&VARS,&I.);
data out&i.(keep=rank&i.);
if 0 then set dsn;
if _N_ =1 then
do;
dcl hash hh(ordered:'A');
dcl hiter hi('hh');
hh.definekey("&var.");
hh.definedata("&var.","rank&i.");
hh.definedone();
end;
/*Get unique combination variable and point in dataset*/
do while(not last);
set dsn end=last;
hh.ref();
end;
/*Assign ranks within hash object*/
rc=hi.first();
k = 1;
do while(rc=0);
rank&i.=k;
hh.replace();
k+1;
rc=hi.next();
end;
/*Output rank to new dataset in original order of observations*/
do while(not theend);
set dsn end=theend;
hh.find();
output;
end;
/*If data can be sorted according to the rank (with no duplicates) use:
hh.output("out&i.");
&outi. will then have variables &var. and rank&i.
However, the merging below may not be sensible anymore
as correspondence between variables is not preserved.
There will also be no duplicates in the dataset.
*/
run;
%END;
%MEND LOOP_OVER;
%LOOP_OVER(%GetVars(dsn));
/*Merge all rank datasets to one large*/
data all;
merge out:;
run;

SAS proc surveyselect don't sort output

I'm using a PROC SURVEYSELECT statement to get random numbers from a set of integers. SAS then returns the sampled integers but in ascending order, and I need them to remain in random order. How would I either randomly mix the output from the SURVEYSELECT statement, or just get the statement not to sort? I can't seem to find any option that lets the statement just output in the order that it randomly selects.
Here's the code:
proc surveyselect data=data noprint
method=srs
n=numOfSamps
seed=123
out=outputSet
run;
As always, thanks in advance!
If you just want to random sort your final stratified sample you can use ranuni() and proc sort.
data data;
set data;
rn = ranuni(12345);
run;
proc sort data = data; by rn; run;
To generate a random permutation of integers size n of x you can use PROC PLAN. I don't know how that would fit with your SRS but then you don't tell the whole story do you?
proc plan;
factors x=10 of 100 random / noprint;
output out=x10;
run;
quit;
This is the default of SURVEYSELECT, if you have just a single SRS.
data have;
call streaminit(7);
do _order = 1 to 1e6;
intnum = floor(rand('Uniform')*1e9);
output;
end;
run;
proc surveyselect data=have noprint
method=srs
n=10000
seed=123
out=outputSet;
run;
As you can see from the included _ORDER variable, it's not sorted in any way - it still retains the initial ordering (and there's nothing special about that variable).
Now, if you are doing stratified SRS, which is implied by your use of a dataset name in the above code (but you leave the details of that out), it will get sorted, and you need to include the _ORDER variable or something similar, and then re-sort by that variable to return to your original order.

Best way to sort a SystemVerilog associative array?

I have an associative array and I need to process the items in that array in a certain order. What's the best way to do that?
Here is an example. Suppose I have an associative array of people's names and their ages:
int age[string];
age["bob"] = 32;
age["timmy"] = 4;
age["tyrian"] = 31;
I need to process this array from youngest person to oldest. Currently, I'm creating another array for indexing and sorting that.
string sorted_age[$];
// Is there a more efficient way to do this sort?
foreach (age[i]) begin
bit inserted = 0;
foreach (sorted_age[j]) begin
if (age[i] < age[sorted_age[j]]) begin
sorted_age.insert(j, i);
inserted = 1;
break;
end
end
if (!inserted) begin
sorted_age.push_back(i);
end
end
Full example on EDA Playground: http://www.edaplayground.com/x/2_8
You could add one more queue then use the built in array methods. I haven't done a performance test (which could be simulator dependent), but it is fewer lines of code and easy to read.
string sorted_age[$];
int store_age [$];
store_age = age.unique(); // find all unique ages (no duplicates)
store_age.sort(); // sort by age
foreach(store_age[i]) begin
// multi entry puch_back
sorted_age = {sorted_age, age.find_index with (item==store_age[i])};
end
Full example on EDA Playground
You can use the sort() with functionality:
// Create an aray of people's names
string sorted_age[] = new [age.size()];
int index = 0;
foreach (age[i]) begin
sorted_age[index++] = i;
end
// Sort the array
sorted_age.sort() with (age[item]);
Full example on EDA Playground: http://www.edaplayground.com/x/3Mf

SAS: Improving the speed of a do loop with proc import

I have over 3400 CSV files, with size varying between 10kb to 3mb. Each CSV files have this generic filename: stockticker-Ret.csv where stockticker is the stock ticker like AAPL, GOOG, YHOO, and so on and has stock returns for at every minute on a given day. My SAS code first start by loading all the stock ticker names from the stockticker-Ret.csv file in a SAS dataset. I loop over each ticker to load the appropriate .csv file in a SAS dataset called want and apply some datasteps on want and store the final dataset want of each ticker in a SAS dataset called global. As you can imagine, this process takes a long time. Is there a way to improve my DO LOOP code below to make this process go faster?
/*Record in a sas dataset all the csv file name to extract the stock ticker*/
data yfiles;
keep filename;
length fref $8 filename $80;
rc = filename(fref, 'F:\data\');
if rc = 0 then do; did = dopen(fref);
rc = filename(fref); end; else do; length msg $200.; msg = sysmsg(); put msg=; did = .; end;
if did <= 0 then putlog 'ERR' 'OR: Unable to open directory.';
dnum = dnum(did);
do i = 1 to dnum; filename = dread(did, i); /* If this entry is a file, then output. */ fid = mopen(did, filename); if fid > 0 then output; end;
rc = dclose(did);
run;
/*store in yfiles all the stock tickers*/
data yfiles(drop=filename1 rename=(filename1=stock));
set yfiles;
filename1=tranwrd(filename,'-Ret.csv','');
run;
proc sql noprint;
select stock into :name separated by '*' from work.yfiles;
%let count2 = &sqlobs;
quit;
*Create the template of the desired GLOBAL SAS dataset;
proc sql;
create table global
(stock char(8), time_gap num(5), avg_ret num(5));
quit;
proc sql;
insert into global
(stock, time_gap,avg_ret)
values('',0,0);
quit;
%macro y1;
%do i = 1 %to &count2;
%let j = %scan(&name,&i,*);
proc import out = want datafile="F:\data\&j-Ret.csv"
dbms=csv replace;
getnames = yes;
run;
data want;
set want; ....
....[Here I do 5 Datasteps on the WANT sasfile]
/*Store the want file in a global SAS dataset that will contain all the stock tickers from the want file*/
data global;
set global want; run;
%end;
%mend y1;
%y1()
As you can see the global SAS dataset expands for every want dataset that I store in global.
Assuming the files have a common layout, you should not import them with PROC IMPORT or do loops. You should read them all in with one datastep. IE:
data want;
length the_file $500;
infile "f:\data\*.csv" dlm=',' lrecl=32767 dsd truncover firstobs=2 filename=the_file;
input
myvar1 myvar2 myvar3 myvar4;
stock_ticker=scan(the_file,'\',-1); *or whatever gets you the ticker name;
run;
Now, if they don't have identical layouts, or there is some complexity to the readin, you may need a more complex input statement than that, but almost always you can achieve it this way. Do loops with lots of PROC IMPORTs will always be inefficient because of the overhead of the IMPORT.
If you don't want every .csv file in the folder (and can't write a mask for what you do want), or if you have a subset of layouts, you can use the FILEVAR option to read the files in from a common dataset. You could then branch into various input statements, perhaps, if needed.
data want;
set yfiles;
infile a filevar=filename;
if filevar [some rule] then do;
input ... ;
end
;else if ... then do;
input ... ;
end;
run;

Do-loop in SAS-IML

I want to use a macro do loop inside proc iml like so:
%Let Tab1=FirstTable;
%Let Tab2=SecondTable;
%Let Tab3=ThirdTable;
*&Tab1-3 have been initialised as sas datasets;
proc iml;
* This works;
use &Tab1;
read all into Mat3;
print Mat3;
* This doesn't work;
%Macro Define_mx;
%do i=1 %to 2;
use &Tab&i;
read all into Mat&i ;
%end;
%Mend Define_mx;
%Define_mx;
*The two matrixes have not been initialised;
print Mat1;
print Mat2;
quit;
In reality I will have to initialise like 50 matrixes so a do-loop is necessary.
I can't figure out why the loop can't see &Tab&i as a macro variable.
I also tried a workaround with a normal (non-macro) do-loop using substr to concatenate the variable names but it didn't work either. What am I missing here ?
Ok, so the macro should be:
%Macro Define_mx;
%do i=1 %to 2;
use &&Tab&i;
read all into Mat&i ;
%end;
%Mend Define_mx;
%Define_mx;
The second amperstand on Tab is necessary as without it the macro processor would try to interpret &Tab as a macro variable (which does not exist). Thus, when trying to concatenate multiple macro variables to create a new one, use &&.
If you have SAS/IML 12.1 (released with 9.3m2), there is an even simpler way.
The USE statement supports dereferencing data set names, like this:
ds = "MyData";
use (ds);
Furthermore, as I show in my article on using the VALSET function, the SAS/IML language supports the VALSET function, which can dynamically create matrices named Mat1, Mat2, and so on.
You can combine these features to eliminate the macros entirely:
data a b c; /* create sample data sets */
x=1;y=2; output;
x=2;y=3; output;
run;
proc iml;
dsnames = {a b c}; /* names of data sets */
do i = 1 to ncol(dsnames);
use (dsnames[i]); /* open each data set */
read all into X;
close (dsname);
MatName = "Mat"+strip(char(i)); /* create Mat1, Mat2,... */
call valset(MatName, X); /* assign values from data set */
end;
show names;

Resources