SAS proc surveyselect don't sort output - sorting

I'm using a PROC SURVEYSELECT statement to get random numbers from a set of integers. SAS then returns the sampled integers but in ascending order, and I need them to remain in random order. How would I either randomly mix the output from the SURVEYSELECT statement, or just get the statement not to sort? I can't seem to find any option that lets the statement just output in the order that it randomly selects.
Here's the code:
proc surveyselect data=data noprint
method=srs
n=numOfSamps
seed=123
out=outputSet
run;
As always, thanks in advance!

If you just want to random sort your final stratified sample you can use ranuni() and proc sort.
data data;
set data;
rn = ranuni(12345);
run;
proc sort data = data; by rn; run;

To generate a random permutation of integers size n of x you can use PROC PLAN. I don't know how that would fit with your SRS but then you don't tell the whole story do you?
proc plan;
factors x=10 of 100 random / noprint;
output out=x10;
run;
quit;

This is the default of SURVEYSELECT, if you have just a single SRS.
data have;
call streaminit(7);
do _order = 1 to 1e6;
intnum = floor(rand('Uniform')*1e9);
output;
end;
run;
proc surveyselect data=have noprint
method=srs
n=10000
seed=123
out=outputSet;
run;
As you can see from the included _ORDER variable, it's not sorted in any way - it still retains the initial ordering (and there's nothing special about that variable).
Now, if you are doing stratified SRS, which is implied by your use of a dataset name in the above code (but you leave the details of that out), it will get sorted, and you need to include the _ORDER variable or something similar, and then re-sort by that variable to return to your original order.

Related

SAS: What's the optimal way to find the sum of a column by another column?

I want to find out the best way to perform a group-by in SAS so I can perform some benchmarks. The simplest two ways I can think of is Proc SQL and Proc means. Here is the example in proc sql
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
I think there are ways to make this run fast
use sasfile command to load the data into memory first
create an index on id
Are there any other options I can use? Any SAS options I should turn on to make this run as fast as possible? I am not tied to proc sql nor proc means, so if there are faster ways then I would love to know about it!!!
My set up code is as below
options macrogen;
options obs=max sortsize=max source2 FULLSTIMER;
options minoperator SASTRACE=',,,d' SASTRACELOC=SASLOG;
options compress = binary NOSTSUFFIX;
options noxwait noxsync;
options LRECL=32767;
proc fcmp outlib=work.myfunc.sample;
function RandBetween(min, max);
return (min + floor((1 + max - min) * rand("uniform")));
endsub;
run;
options cmplib=work.myfunc;
data RandInt;
do i = 1 to 250000000;
id = RandBetween(1, 2500000);
val = rand("uniform");
output;
end;
drop i;
run;
My SAS comparison macros are as below
%macro sasbench(dosql = N); %macro _; %mend;
%if &dosql. = Y %then %do;
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
%end;
proc means data=randint sum noprint;
var val ;
class id;
output out = summmeans(drop=_type_ _freq_) sum = /autoname;
run;
%mend;
%sasbench();
/**/
/*sasfile randint load;*/
/*%sasbench();*/
/*sasfile randint close;*/
proc datasets lib=work;
modify randint;
INDEX CREATE id / nomiss;
run;
%sasbench();
sasfile is only a benefit if the entire data set can fit into session ram limits and if the data set is going to be used more than once. I suppose this would make sense if your benchmark includes multiple runs / different techniques on the same sasfile.
An index on id would help if the data was unsorted by id. When the data set is presorted by id the id column metadata will have sortedby flag set which a procedure can use for its own internal optimization, however there is no guarantee. As for indexes, use option msglevel=i to get informational messages in the log about index selection during processing.
The fastest way is direct addressing, but requires enough ram to handle the largest id value as an array index:
array ids(250000000) _temporary_
ids(id) + value
The next fastest way is probably hand coded array based hashing:
search SAS conference proceedings for papers by Paul Dorfman
The next fastest hash way is probably the hash component object with key suminc.
DATA Step was edited to align with the comments
data demo_data;
do rownum = 1 to 1000;
id = ceil(100*ranuni(123)); * NOTE: 100 different groups, disordered;
value = ceil(1000*ranuni(123)); * NOTE: want to sum value over group, for demonstration individual values integers from 1..1000;
output;
end;
run;
data _null_;
if 0 then set demo_data(keep=id value); %* prep pdv ;
length total 8; %* prep keysum variable ;
call missing (total); %* prevent warnings ;
declare hash ids (ordered:'a', suminc:'value', keysum:'total'); %* ordered ensures keys will be sorted ascending upon output ;
ids.defineKey('id');
*ids.defineData('id'); % * not having a defineData is an implicit way of adding only the keys as data, only data + keysum variables are .output;
ids.defineDone();
* read all records and touch each hash key in order to perform tacit total+value summation;
do until (end);
set demo_data end=end;
if ids.find() ne 0 then ids.add();
end;
ids.output(dataset:'sum_value_over_id'); * save the summation of each key combination;
stop;
run;
Note: There can be only one keysum variable.
If the suminc variable was set to be always 1 instead of value, then the keysum would be the count instead of the total.
Obtaining both sum and count over group via hash would require an explicit defineData for a count and sum variable and slightly different statements, such as:
declare hash ids (ordered:'a');
...
ids.defineData('id', 'count', 'total');
...
if ids.find() ne 0 then do; count=0; total=0; end;
count+1;
total+value;
ids.replace();
...
However, if value is known to be always a natural number, and group size is known to be < 10group size limit you could numerically encode the count by using a suminc of value + 10-group size limit and numerically decode count by processing the output data with count = (total - int(total)) * 10group size limit.
For sorted data the fastest way is most likely a DOW loop with accumulation.
proc sort data=foo;
by id;
data sum_value_over_id_v2(keep=id total);
do until (last.id);
set foo;
by id;
total = sum(total, value);
end;
run;
You will likely find that I/O is largest component of performance.
The best answer varies dramatically by the application. In your example, PROC SQL at least on my machine significantly outperforms PROC MEANS, but there are plenty of cases where it will not do so. It's able to in this case because it's building hash tables behind the scenes, more than likely, which are quite fast - a single pass through the data is all that's needed.
You certainly could speed things up by putting your full dataset into memory with SASFILE, if you have room to store the whole thing. You would have to have it in memory to begin with, though, more than likely; just reading it into memory for this purpose alone wouldn't really help since you're doing that read anyway.
As Richard notes, there are a bunch of ways to do this. I think PROC SQL will often be the fastest or similar to the fastest in simple cases, both because it's multithreaded (as opposed to data step being single threaded) and because it's got a fast hash table backend.
PROC MEANS is also usually going to be competitive, the case you show in the example is almost a worst case for it since it's got a huge number of class variables so I think it may be creating a temporary table on disk. It's also multithreaded. Reduce the class variable categories to 2500 instead of 2,500,000 and you get PROC MEANS a bit faster than PROC SQL (but within the margin of error).
Data step accumulation, either in a hash table or a DoW loop, will sometimes outperform both of the above, and sometimes not, again depending on the data. Here it does outperform slightly. The code for data step accumulation tends to be a bit more complex, which is why I'd usually discourage it unless the savings is substantial (having more code to maintain is worse, typically). PROC MEANS and PROC SQL require less maintenance and less to understand. But in applications where performance is critical and these solutions happen to be superior, it may be worth it to go this route, especially if the data step is helpful. Of course, the hash table method is limited to fitting the results in memory, though usually that's manageable.
Ultimately, I would encourage you to use whatever method is easiest to maintain but still gives sufficient performance; and when possible try to be self consistent with other code. If most of your code is in SQL, that is probably fine. SASFILE and indexes probably won't be needed, unless you're doing more complicated things than you present above. Summation is actually more work than I/O in many cases. Don't overcomplicate it, ultimately: programmer hours and difficulty of QA is something that should trump basic performance, unless you're talking several hours' difference. And if you are, then just run tests on your actual use case and see what works best.
If you assume the data is sorted then this is another solution
data sum_value_over_id_v2(keep=id total);
set a.randint(keep=id val);
by id;
total + val;
if last.id then do;
output;
total = 0;
end;
drop val;
run;

How to rank multiple variables in a large data set?

I have a data set of around 50 million records with around 30 variables(columns).
I need to rank the dataset for each variable.
Proc rank does not work since it required lot of memory for this large dataset.
To give rank manually, I have to sort the dataset on the respective variable column and then give rank by using a formula. But the problem is we have to sort the dataset 30 times on 30 variables which will take very very long time and not feasible.
What alternates can we use in this case?
You're in a tough spot without many options. If you're sorting and keeping all 30 variables each time, that will significantly increase your processing times. If I were you, I'd only keep the variable you want to rank and a sequence number to apply your formula, then merge it all back together at the end. This would require you to loop over each variable in your dataset then merge it all back together. See example below and if it would help decrease your processing times:
** PUT ALL VARIABLES INTO LIST **;
PROC SQL NOPRINT;
SELECT DISTINCT(NAME)
INTO :VARS SEPARATED BY " "
FROM DICTIONARY.COLUMNS
WHERE LIBNAME = 'SASHELP' AND MEMNAME = 'FISH';
QUIT;
%PUT &VARS.;
** CREATE SEQUENCE NUMBER IN FULL DATA **;
DATA FISH; SET SASHELP.FISH;
SEQ=_N_;
RUN;
** LOOP OVER EACH VARIABLE TO ONLY PROCESS THAT VARIABLE AND SEQUENCE -- REDUCES PROCESSING TIME **;
%MACRO LOOP_OVER(VARS);
%DO I=1 %TO %SYSFUNC(COUNTW(&VARS.));
%LET VAR = %SCAN(&VARS,&I.);
DATA FISH_&I.; SET FISH (KEEP=SEQ &VAR.);
RUN;
/* INSERT YOUR FORMULA CODE HERE ON FISH_&I. DATA (MINE IS AN EXAMPLE) */
PROC SORT DATA = FISH_&I.;
BY &VAR.;
RUN;
DATA FISH1_&I.; SET FISH_&I.;
BY &VAR.;
RANK_&VAR = _N_;
RUN;
/* RESORT FINAL DATA BY SEQUENCE NUMBER VARIABLE */
PROC SORT DATA = FISH1_&I.;
BY SEQ;
RUN;
%END;
%MEND;
%LOOP_OVER(&VARS.);
** MERGE ALL SUBSETS BACK TOGETHER BY THE ORIGINAL SEQUENCE NUMBER **;
DATA FINAL;
MERGE FISH1_:;
BY SEQ;
DROP SEQ;
RUN;
If you just need to rank into deciles / percentiles etc rather than a complete ranking from 1 to 50m across all 50m rows, you should be able to get a very good approximation of the correct answer using a much smaller amount of memory via proc summary, using qmethod=P2 and specifying a suitable qmarkers setting.
This approach uses the P-squared algorithm:
http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf
I am not sure, whether it is a good idea: But you may want to use a Hash object. The object is loaded into your RAM. Assuming that you have 30 Mio of numerical observations, you will need around (2*8bytes)*50 mio = 800MB of RAM -- if I am not mistaken.
The code could look like this (using Foxers Macro to loop over the variables, a little helper macro to get the list of variables from a dataset and a small test dataset with two variables):
%Macro GetVars(Dset) ;
%Local VarList ;
/* open dataset */
%Let FID = %SysFunc(Open(&Dset)) ;
/* If accessable, process contents of dataset */
%If &FID %Then %Do ;
%Do I=1 %To %SysFunc(ATTRN(&FID,NVARS)) ;
%Let VarList= &VarList %SysFunc(VarName(&FID,&I));
%End ;
/* close dataset when complete */
%Let FID = %SysFunc(Close(&FID)) ;
%End ;
&VarList
%Mend ;
data dsn;
input var1 var2;
datalines;
1 48
1 8
2 5
2 965
3 105
4 105
3 85
;
run;
%MACRO LOOP_OVER(VARS);
%DO I=1 %TO %SYSFUNC(COUNTW(&VARS.));
%LET var = %SCAN(&VARS,&I.);
data out&i.(keep=rank&i.);
if 0 then set dsn;
if _N_ =1 then
do;
dcl hash hh(ordered:'A');
dcl hiter hi('hh');
hh.definekey("&var.");
hh.definedata("&var.","rank&i.");
hh.definedone();
end;
/*Get unique combination variable and point in dataset*/
do while(not last);
set dsn end=last;
hh.ref();
end;
/*Assign ranks within hash object*/
rc=hi.first();
k = 1;
do while(rc=0);
rank&i.=k;
hh.replace();
k+1;
rc=hi.next();
end;
/*Output rank to new dataset in original order of observations*/
do while(not theend);
set dsn end=theend;
hh.find();
output;
end;
/*If data can be sorted according to the rank (with no duplicates) use:
hh.output("out&i.");
&outi. will then have variables &var. and rank&i.
However, the merging below may not be sensible anymore
as correspondence between variables is not preserved.
There will also be no duplicates in the dataset.
*/
run;
%END;
%MEND LOOP_OVER;
%LOOP_OVER(%GetVars(dsn));
/*Merge all rank datasets to one large*/
data all;
merge out:;
run;

SAS - Proc Means and Granularity of Search

I am new to SAS would like my searches to be a more granular.
In this example, I would like my output to show the variable SalesPrice and only those with CentralAir (another variable). I would like to show the statistics for another variable. In this case I want to view the OverallQal variable only if the variable is over 7 and the observations have '1FAM' as its BldgType (which is another variable).
I understand my syntax is incorrect. Any guidance would be appreciated. Thank you!
proc means data=MYDATA.AMES_HOUSING_DATA n nmiss p1 p10 q1 mean q3 p90 stddev median;
var SalePrice if (CentralAir = 'Y');
var OverallQual if (OverallQual GT 7 AND BldgType = '1FAM');
run;
Use a WHERE statement (or the WHERE= dataset option) to limit the records that the proc uses. You can only use one WHERE clause per procedure though, so you would need to run it twice to select the two different sets of records. You also might want to use PROC UNIVARIATE to get a summary of your variables' distribution.
proc univariate data=MYDATA.AMES_HOUSING_DATA ;
where CentralAir = 'Y';
var SalePrice;
run;
proc univariate data=MYDATA.AMES_HOUSING_DATA ;
where OverallQual GT 7 AND BldgType = '1FAM' ;
var OverallQual ;
run;

correlation matrix and statistics for all numerical columns in SAS

I have a data set with the name final_data which has numerical fields and some string fields. what i want to do is this
Print out a correlation matrix between all numeric variables in the data set and Compute the mean, min, max and number of missing for all the numeric variables in the data
Now i know how to calculate mean min and max by specifying the variables explicitly but i have no clue how to do it for numerical values. Also i dont know how to calculate number of missing values. As for correlation matrix between all numerical fields, i have no clue how to do that.
PS for column names u may use num1 num2 str1 str2 so on for numerical and string columns respectively.
Statistical procedures usually act on all numeric variables, so you actually don't need to specify them, e.g.:
proc corr data=sashelp.prdsale;
run;
proc means data=sashelp.prdsale mean min max nmiss;
run;

How to remove duplicated records\observations WITHOUT sorting in SAS?

I wonder if there is a way to unduplicate records WITHOUT sorting?Sometimes, I want to keep original order and just want to remove duplicated records.
Is it possible?
BTW, below are what I know regarding unduplicating records, which does sorting in the end..
1.
proc sql;
create table yourdata_nodupe as
select distinct *
From abc;
quit;
2.
proc sort data=YOURDATA nodupkey;
by var1 var2 var3 var4 var5;
run;
You could use a hash object to keep track of which values have been seen as you pass through the data set. Only output when you encounter a key that hasn't been observed yet. This outputs in the order the data was observed in the input data set.
Here is an example using the input data set "sashelp.cars". The original data was in alphabetical order by Make so you can see that the output data set "nodupes" maintains that same order.
data nodupes (drop=rc);;
length Make $13.;
declare hash found_keys();
found_keys.definekey('Make');
found_keys.definedone();
do while (not done);
set sashelp.cars end=done;
rc=found_keys.check();
if rc^=0 then do;
rc=found_keys.add();
output;
end;
end;
stop;
run;
proc print data=nodupes;run;
/* Give each record in the original dataset and row number */
data with_id ;
set mydata ;
_id = _n_ ;
run ;
/* Remove dupes */
proc sort data=with_id nodupkey ;
by var1 var2 var3 ;
run ;
/* Sort back into original order */
proc sort data=with_id ;
by _id ;
run ;
I think the short answer is no, there isn't, at least not a way that wouldn't have a much bigger performance hit than a method based on sorting.
There may be specific cases where this is possible (a dataset where all variables are indexed? A relatively small dataset that you could reasonably load into memory and work with there?) but this wouldn't help you with a general method.
Something along the lines of Chris J's solution is probably the best way to get the outcome you're after, but that's not an answer to your actual question.
Depending on the number of variables in your data set, the following might be practical:
data abc_nodup;
set abc;
retain _var1 _var2 _var3 _var4;
if _n_ eq 1 then output;
else do;
if (var1 eq _var1) and (var2 eq _var2) and
(var3 eq _var3) and (var4 eq _var4)
then delete;
else output;
end;
_var1 = var1;
_var2 = var2;
_var3 = var3;
_var4 = var4;
drop _var:;
run;
Please refer to Usage Note 37581: How can I eliminate duplicate observations from a large data set without sorting, http://support.sas.com/kb/37/581.html . Usage Note 37581 shows how PROC SUMMARY can be used to more efficiently remove duplicates without the use of sorting.
The two examples given in the original post are not identical.
distinct in proc sql only removes lines which are fully identical
nodupkey in proc sort removes any line where key variables are identical (even if other variables are not identical). You need the option noduprecs to remove fully identical lines.
If you are only looking for records having common key variables, another solution I could think of would be to create a dataset with only the key variable(s) and find out which one are duplicates and then apply a format on the original data to flag duplicate records. If more than one key variable is present in the dataset, one would need to create a new variable containing the concatenation of all the key variable values - converted to character if needed.
This is the fastest way I can think of. It requires no sorting.
data output_data_name;
set input_data_name (
sortedby = person_id stay
keep =
person_id
stay
... more variables ...);
by person_id stay;
if first.stay > 0 then output;
run;
data output;
set yourdata;
by var notsorted;
if first.var then output;
run;
This will not sort the data but will remove duplicates within each group.

Resources