How to rank multiple variables in a large data set? - sorting

I have a data set of around 50 million records with around 30 variables(columns).
I need to rank the dataset for each variable.
Proc rank does not work since it required lot of memory for this large dataset.
To give rank manually, I have to sort the dataset on the respective variable column and then give rank by using a formula. But the problem is we have to sort the dataset 30 times on 30 variables which will take very very long time and not feasible.
What alternates can we use in this case?

You're in a tough spot without many options. If you're sorting and keeping all 30 variables each time, that will significantly increase your processing times. If I were you, I'd only keep the variable you want to rank and a sequence number to apply your formula, then merge it all back together at the end. This would require you to loop over each variable in your dataset then merge it all back together. See example below and if it would help decrease your processing times:
** PUT ALL VARIABLES INTO LIST **;
PROC SQL NOPRINT;
SELECT DISTINCT(NAME)
INTO :VARS SEPARATED BY " "
FROM DICTIONARY.COLUMNS
WHERE LIBNAME = 'SASHELP' AND MEMNAME = 'FISH';
QUIT;
%PUT &VARS.;
** CREATE SEQUENCE NUMBER IN FULL DATA **;
DATA FISH; SET SASHELP.FISH;
SEQ=_N_;
RUN;
** LOOP OVER EACH VARIABLE TO ONLY PROCESS THAT VARIABLE AND SEQUENCE -- REDUCES PROCESSING TIME **;
%MACRO LOOP_OVER(VARS);
%DO I=1 %TO %SYSFUNC(COUNTW(&VARS.));
%LET VAR = %SCAN(&VARS,&I.);
DATA FISH_&I.; SET FISH (KEEP=SEQ &VAR.);
RUN;
/* INSERT YOUR FORMULA CODE HERE ON FISH_&I. DATA (MINE IS AN EXAMPLE) */
PROC SORT DATA = FISH_&I.;
BY &VAR.;
RUN;
DATA FISH1_&I.; SET FISH_&I.;
BY &VAR.;
RANK_&VAR = _N_;
RUN;
/* RESORT FINAL DATA BY SEQUENCE NUMBER VARIABLE */
PROC SORT DATA = FISH1_&I.;
BY SEQ;
RUN;
%END;
%MEND;
%LOOP_OVER(&VARS.);
** MERGE ALL SUBSETS BACK TOGETHER BY THE ORIGINAL SEQUENCE NUMBER **;
DATA FINAL;
MERGE FISH1_:;
BY SEQ;
DROP SEQ;
RUN;

If you just need to rank into deciles / percentiles etc rather than a complete ranking from 1 to 50m across all 50m rows, you should be able to get a very good approximation of the correct answer using a much smaller amount of memory via proc summary, using qmethod=P2 and specifying a suitable qmarkers setting.
This approach uses the P-squared algorithm:
http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf

I am not sure, whether it is a good idea: But you may want to use a Hash object. The object is loaded into your RAM. Assuming that you have 30 Mio of numerical observations, you will need around (2*8bytes)*50 mio = 800MB of RAM -- if I am not mistaken.
The code could look like this (using Foxers Macro to loop over the variables, a little helper macro to get the list of variables from a dataset and a small test dataset with two variables):
%Macro GetVars(Dset) ;
%Local VarList ;
/* open dataset */
%Let FID = %SysFunc(Open(&Dset)) ;
/* If accessable, process contents of dataset */
%If &FID %Then %Do ;
%Do I=1 %To %SysFunc(ATTRN(&FID,NVARS)) ;
%Let VarList= &VarList %SysFunc(VarName(&FID,&I));
%End ;
/* close dataset when complete */
%Let FID = %SysFunc(Close(&FID)) ;
%End ;
&VarList
%Mend ;
data dsn;
input var1 var2;
datalines;
1 48
1 8
2 5
2 965
3 105
4 105
3 85
;
run;
%MACRO LOOP_OVER(VARS);
%DO I=1 %TO %SYSFUNC(COUNTW(&VARS.));
%LET var = %SCAN(&VARS,&I.);
data out&i.(keep=rank&i.);
if 0 then set dsn;
if _N_ =1 then
do;
dcl hash hh(ordered:'A');
dcl hiter hi('hh');
hh.definekey("&var.");
hh.definedata("&var.","rank&i.");
hh.definedone();
end;
/*Get unique combination variable and point in dataset*/
do while(not last);
set dsn end=last;
hh.ref();
end;
/*Assign ranks within hash object*/
rc=hi.first();
k = 1;
do while(rc=0);
rank&i.=k;
hh.replace();
k+1;
rc=hi.next();
end;
/*Output rank to new dataset in original order of observations*/
do while(not theend);
set dsn end=theend;
hh.find();
output;
end;
/*If data can be sorted according to the rank (with no duplicates) use:
hh.output("out&i.");
&outi. will then have variables &var. and rank&i.
However, the merging below may not be sensible anymore
as correspondence between variables is not preserved.
There will also be no duplicates in the dataset.
*/
run;
%END;
%MEND LOOP_OVER;
%LOOP_OVER(%GetVars(dsn));
/*Merge all rank datasets to one large*/
data all;
merge out:;
run;

Related

Proc compare loop through multiple datasets in a folder SAS

I have two folders containing numerous datasets. Each folder contains identical datasets and I'd like to compare to ensure they are similar. Is it possible to loop through each folder and compare each dataset?
%macro compare(dpath=, cpath=,);
%do i = 1 %to n;
proc compare base = &dpath data = &cpath;
run;
%mend;
%compare(dpath=folder1_path, cpath=folder2_path);
Point librefs to the "folders". Get the lists of datasets. Use the list to drive the code generation.
%macro compare(dpath,cpath);
libname left "&dpath";
libname right "&cpath";
proc contents data=left._all_ noprint out=contents;
run;
data _null_;
set contents;
by membname;
if first.memname;
call execute(catx(' '
,'proc compare base=',cats('left.',memname)
,'compare=',cats('right.',memname)
,';run;'
));
run;
%mend;
%compare
(dpath=folder1_path
,cpath=folder2_path
);
To make it more robust you might want to do things like check that the member names in LEFT actually match the member names in RIGHT. Or add an ID statement to the generated PROC COMPARE code so that PROC COMPARE knows how to match observations, otherwise it will just match the observations in the order they appear.
This macro will compare the contents of all data exactly as-is and output if there are any differences at all. If there are no differences, the dataset all_differences will not be created.
%macro compare(dpath=, cpath=);
libname d "&dpath";
libname c "&cpath";
/* Save all datasets to macro variables:
&dataset1 &dataset2 etc. */
data _null_;
set sashelp.vmember;
where libname = 'D';
call symputx(cats('name', _N_), memname);
call symputx('n', _N_);
run;
proc datasets lib=work nolist;
delete all_differences;
quit;
%do i = 1 %to &n;
/* Compare dataset names and only output if they are unequal */
proc compare base = d.&&name&i
compare = c.&&name&i
out = outcomp
outnoequal;
run;
/* Get the number of obs from outcomp */
%let dsid = %sysfunc(open(outcomp));
%let nobs = %sysfunc(attrn(&dsid, nlobs));
%let rc = %sysfunc(close(&dsid));
/* If outcomp is populated, log the dataset with differences */
%if(&nobs > 0) %then %do;
data _difference_;
length dsn $32.;
dsn = "&&name&i.";
run;
proc append base=all_differences
data=_difference_
force;
run;
%end;
%end;
%mend;

SAS: What's the optimal way to find the sum of a column by another column?

I want to find out the best way to perform a group-by in SAS so I can perform some benchmarks. The simplest two ways I can think of is Proc SQL and Proc means. Here is the example in proc sql
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
I think there are ways to make this run fast
use sasfile command to load the data into memory first
create an index on id
Are there any other options I can use? Any SAS options I should turn on to make this run as fast as possible? I am not tied to proc sql nor proc means, so if there are faster ways then I would love to know about it!!!
My set up code is as below
options macrogen;
options obs=max sortsize=max source2 FULLSTIMER;
options minoperator SASTRACE=',,,d' SASTRACELOC=SASLOG;
options compress = binary NOSTSUFFIX;
options noxwait noxsync;
options LRECL=32767;
proc fcmp outlib=work.myfunc.sample;
function RandBetween(min, max);
return (min + floor((1 + max - min) * rand("uniform")));
endsub;
run;
options cmplib=work.myfunc;
data RandInt;
do i = 1 to 250000000;
id = RandBetween(1, 2500000);
val = rand("uniform");
output;
end;
drop i;
run;
My SAS comparison macros are as below
%macro sasbench(dosql = N); %macro _; %mend;
%if &dosql. = Y %then %do;
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
%end;
proc means data=randint sum noprint;
var val ;
class id;
output out = summmeans(drop=_type_ _freq_) sum = /autoname;
run;
%mend;
%sasbench();
/**/
/*sasfile randint load;*/
/*%sasbench();*/
/*sasfile randint close;*/
proc datasets lib=work;
modify randint;
INDEX CREATE id / nomiss;
run;
%sasbench();
sasfile is only a benefit if the entire data set can fit into session ram limits and if the data set is going to be used more than once. I suppose this would make sense if your benchmark includes multiple runs / different techniques on the same sasfile.
An index on id would help if the data was unsorted by id. When the data set is presorted by id the id column metadata will have sortedby flag set which a procedure can use for its own internal optimization, however there is no guarantee. As for indexes, use option msglevel=i to get informational messages in the log about index selection during processing.
The fastest way is direct addressing, but requires enough ram to handle the largest id value as an array index:
array ids(250000000) _temporary_
ids(id) + value
The next fastest way is probably hand coded array based hashing:
search SAS conference proceedings for papers by Paul Dorfman
The next fastest hash way is probably the hash component object with key suminc.
DATA Step was edited to align with the comments
data demo_data;
do rownum = 1 to 1000;
id = ceil(100*ranuni(123)); * NOTE: 100 different groups, disordered;
value = ceil(1000*ranuni(123)); * NOTE: want to sum value over group, for demonstration individual values integers from 1..1000;
output;
end;
run;
data _null_;
if 0 then set demo_data(keep=id value); %* prep pdv ;
length total 8; %* prep keysum variable ;
call missing (total); %* prevent warnings ;
declare hash ids (ordered:'a', suminc:'value', keysum:'total'); %* ordered ensures keys will be sorted ascending upon output ;
ids.defineKey('id');
*ids.defineData('id'); % * not having a defineData is an implicit way of adding only the keys as data, only data + keysum variables are .output;
ids.defineDone();
* read all records and touch each hash key in order to perform tacit total+value summation;
do until (end);
set demo_data end=end;
if ids.find() ne 0 then ids.add();
end;
ids.output(dataset:'sum_value_over_id'); * save the summation of each key combination;
stop;
run;
Note: There can be only one keysum variable.
If the suminc variable was set to be always 1 instead of value, then the keysum would be the count instead of the total.
Obtaining both sum and count over group via hash would require an explicit defineData for a count and sum variable and slightly different statements, such as:
declare hash ids (ordered:'a');
...
ids.defineData('id', 'count', 'total');
...
if ids.find() ne 0 then do; count=0; total=0; end;
count+1;
total+value;
ids.replace();
...
However, if value is known to be always a natural number, and group size is known to be < 10group size limit you could numerically encode the count by using a suminc of value + 10-group size limit and numerically decode count by processing the output data with count = (total - int(total)) * 10group size limit.
For sorted data the fastest way is most likely a DOW loop with accumulation.
proc sort data=foo;
by id;
data sum_value_over_id_v2(keep=id total);
do until (last.id);
set foo;
by id;
total = sum(total, value);
end;
run;
You will likely find that I/O is largest component of performance.
The best answer varies dramatically by the application. In your example, PROC SQL at least on my machine significantly outperforms PROC MEANS, but there are plenty of cases where it will not do so. It's able to in this case because it's building hash tables behind the scenes, more than likely, which are quite fast - a single pass through the data is all that's needed.
You certainly could speed things up by putting your full dataset into memory with SASFILE, if you have room to store the whole thing. You would have to have it in memory to begin with, though, more than likely; just reading it into memory for this purpose alone wouldn't really help since you're doing that read anyway.
As Richard notes, there are a bunch of ways to do this. I think PROC SQL will often be the fastest or similar to the fastest in simple cases, both because it's multithreaded (as opposed to data step being single threaded) and because it's got a fast hash table backend.
PROC MEANS is also usually going to be competitive, the case you show in the example is almost a worst case for it since it's got a huge number of class variables so I think it may be creating a temporary table on disk. It's also multithreaded. Reduce the class variable categories to 2500 instead of 2,500,000 and you get PROC MEANS a bit faster than PROC SQL (but within the margin of error).
Data step accumulation, either in a hash table or a DoW loop, will sometimes outperform both of the above, and sometimes not, again depending on the data. Here it does outperform slightly. The code for data step accumulation tends to be a bit more complex, which is why I'd usually discourage it unless the savings is substantial (having more code to maintain is worse, typically). PROC MEANS and PROC SQL require less maintenance and less to understand. But in applications where performance is critical and these solutions happen to be superior, it may be worth it to go this route, especially if the data step is helpful. Of course, the hash table method is limited to fitting the results in memory, though usually that's manageable.
Ultimately, I would encourage you to use whatever method is easiest to maintain but still gives sufficient performance; and when possible try to be self consistent with other code. If most of your code is in SQL, that is probably fine. SASFILE and indexes probably won't be needed, unless you're doing more complicated things than you present above. Summation is actually more work than I/O in many cases. Don't overcomplicate it, ultimately: programmer hours and difficulty of QA is something that should trump basic performance, unless you're talking several hours' difference. And if you are, then just run tests on your actual use case and see what works best.
If you assume the data is sorted then this is another solution
data sum_value_over_id_v2(keep=id total);
set a.randint(keep=id val);
by id;
total + val;
if last.id then do;
output;
total = 0;
end;
drop val;
run;

SAS proc surveyselect don't sort output

I'm using a PROC SURVEYSELECT statement to get random numbers from a set of integers. SAS then returns the sampled integers but in ascending order, and I need them to remain in random order. How would I either randomly mix the output from the SURVEYSELECT statement, or just get the statement not to sort? I can't seem to find any option that lets the statement just output in the order that it randomly selects.
Here's the code:
proc surveyselect data=data noprint
method=srs
n=numOfSamps
seed=123
out=outputSet
run;
As always, thanks in advance!
If you just want to random sort your final stratified sample you can use ranuni() and proc sort.
data data;
set data;
rn = ranuni(12345);
run;
proc sort data = data; by rn; run;
To generate a random permutation of integers size n of x you can use PROC PLAN. I don't know how that would fit with your SRS but then you don't tell the whole story do you?
proc plan;
factors x=10 of 100 random / noprint;
output out=x10;
run;
quit;
This is the default of SURVEYSELECT, if you have just a single SRS.
data have;
call streaminit(7);
do _order = 1 to 1e6;
intnum = floor(rand('Uniform')*1e9);
output;
end;
run;
proc surveyselect data=have noprint
method=srs
n=10000
seed=123
out=outputSet;
run;
As you can see from the included _ORDER variable, it's not sorted in any way - it still retains the initial ordering (and there's nothing special about that variable).
Now, if you are doing stratified SRS, which is implied by your use of a dataset name in the above code (but you leave the details of that out), it will get sorted, and you need to include the _ORDER variable or something similar, and then re-sort by that variable to return to your original order.

SAS: Iterate over dataset names

I have a sas program that merges two datasets containing information on a given city. The city name is part of the title of every dataset, e.g. for Atlanta:
data work.AtlantaComplete;
merge dir.Atlanta1 dir.Atlanta2;
by mergevar;
run;
I would like to run the merge on a long list of cities without having to make a separate .sas file for each one. With limited knowledge of SAS I tried the following:
%let city1 = Boston;
...
%let cityN = Miami;
%DO for i = 1 to N;
data work.city&i;
set dir.city&i.1 dir.city&i.2;
by mergevar;
run;
This produced several errors, the main one being that %DO statements must be inside a data step. This won't work for my task because the loop has to be defined before the first data step.
A solution that can be implemented within the sas program would be optimal, but I am also open to something like a Unix Bash shell script that provides each city one at a time as a system input to the sas program (similar to sys.argv in Python).
Thanks for your help
You have several small mistakes in your program.
Executing a %do loop is normally done inside a macro. Also you dont use keyword "for" and need a % in front of the to.
so try this:
%let city1 = Boston;
...
%let cityN = Miami;
%let N = 42; *or whatever your N is...;
%macro mergecities(citynumber);
%DO i = 1 %to &citynumber;
data work.&&city&i;
set dir.&&city&i dir.&&city&i;
by mergevar;
run;
%end;
%mend;
%mergecities(&N);
Instead of using the macrovariable citynumber you can directly use &N inside the do loop, but used with a parameter the macro is more flexible...
If you have numbered macro variables, you use &&varname&i to resolve them. Also, by having your cities in a dataset, you can create the macro variables off the back of it, rather than hard-coding them all (plus the count).
data cities ;
input City $20. ;
/* Create numbered macro variables based on _n_ */
call symputx(cats('CITY',_n_),City) ;
call symputx('N',_n_) ;
datalines ;
Atlanta
Boston
Chicago
Houston
Texas
;
run ;
%MACRO LOOP ;
%DO I = 1 %TO &N ;
data &&CITY&I..Complete ;
merge dir.&&CITY&I..1
dir.&&CITY&I..2 ;
by mergevar ;
run ;
%END ;
%MEND ;
%LOOP ;

SAS: Improving the speed of a do loop with proc import

I have over 3400 CSV files, with size varying between 10kb to 3mb. Each CSV files have this generic filename: stockticker-Ret.csv where stockticker is the stock ticker like AAPL, GOOG, YHOO, and so on and has stock returns for at every minute on a given day. My SAS code first start by loading all the stock ticker names from the stockticker-Ret.csv file in a SAS dataset. I loop over each ticker to load the appropriate .csv file in a SAS dataset called want and apply some datasteps on want and store the final dataset want of each ticker in a SAS dataset called global. As you can imagine, this process takes a long time. Is there a way to improve my DO LOOP code below to make this process go faster?
/*Record in a sas dataset all the csv file name to extract the stock ticker*/
data yfiles;
keep filename;
length fref $8 filename $80;
rc = filename(fref, 'F:\data\');
if rc = 0 then do; did = dopen(fref);
rc = filename(fref); end; else do; length msg $200.; msg = sysmsg(); put msg=; did = .; end;
if did <= 0 then putlog 'ERR' 'OR: Unable to open directory.';
dnum = dnum(did);
do i = 1 to dnum; filename = dread(did, i); /* If this entry is a file, then output. */ fid = mopen(did, filename); if fid > 0 then output; end;
rc = dclose(did);
run;
/*store in yfiles all the stock tickers*/
data yfiles(drop=filename1 rename=(filename1=stock));
set yfiles;
filename1=tranwrd(filename,'-Ret.csv','');
run;
proc sql noprint;
select stock into :name separated by '*' from work.yfiles;
%let count2 = &sqlobs;
quit;
*Create the template of the desired GLOBAL SAS dataset;
proc sql;
create table global
(stock char(8), time_gap num(5), avg_ret num(5));
quit;
proc sql;
insert into global
(stock, time_gap,avg_ret)
values('',0,0);
quit;
%macro y1;
%do i = 1 %to &count2;
%let j = %scan(&name,&i,*);
proc import out = want datafile="F:\data\&j-Ret.csv"
dbms=csv replace;
getnames = yes;
run;
data want;
set want; ....
....[Here I do 5 Datasteps on the WANT sasfile]
/*Store the want file in a global SAS dataset that will contain all the stock tickers from the want file*/
data global;
set global want; run;
%end;
%mend y1;
%y1()
As you can see the global SAS dataset expands for every want dataset that I store in global.
Assuming the files have a common layout, you should not import them with PROC IMPORT or do loops. You should read them all in with one datastep. IE:
data want;
length the_file $500;
infile "f:\data\*.csv" dlm=',' lrecl=32767 dsd truncover firstobs=2 filename=the_file;
input
myvar1 myvar2 myvar3 myvar4;
stock_ticker=scan(the_file,'\',-1); *or whatever gets you the ticker name;
run;
Now, if they don't have identical layouts, or there is some complexity to the readin, you may need a more complex input statement than that, but almost always you can achieve it this way. Do loops with lots of PROC IMPORTs will always be inefficient because of the overhead of the IMPORT.
If you don't want every .csv file in the folder (and can't write a mask for what you do want), or if you have a subset of layouts, you can use the FILEVAR option to read the files in from a common dataset. You could then branch into various input statements, perhaps, if needed.
data want;
set yfiles;
infile a filevar=filename;
if filevar [some rule] then do;
input ... ;
end
;else if ... then do;
input ... ;
end;
run;

Resources