SAS: Iterate over dataset names - bash

I have a sas program that merges two datasets containing information on a given city. The city name is part of the title of every dataset, e.g. for Atlanta:
data work.AtlantaComplete;
merge dir.Atlanta1 dir.Atlanta2;
by mergevar;
run;
I would like to run the merge on a long list of cities without having to make a separate .sas file for each one. With limited knowledge of SAS I tried the following:
%let city1 = Boston;
...
%let cityN = Miami;
%DO for i = 1 to N;
data work.city&i;
set dir.city&i.1 dir.city&i.2;
by mergevar;
run;
This produced several errors, the main one being that %DO statements must be inside a data step. This won't work for my task because the loop has to be defined before the first data step.
A solution that can be implemented within the sas program would be optimal, but I am also open to something like a Unix Bash shell script that provides each city one at a time as a system input to the sas program (similar to sys.argv in Python).
Thanks for your help

You have several small mistakes in your program.
Executing a %do loop is normally done inside a macro. Also you dont use keyword "for" and need a % in front of the to.
so try this:
%let city1 = Boston;
...
%let cityN = Miami;
%let N = 42; *or whatever your N is...;
%macro mergecities(citynumber);
%DO i = 1 %to &citynumber;
data work.&&city&i;
set dir.&&city&i dir.&&city&i;
by mergevar;
run;
%end;
%mend;
%mergecities(&N);
Instead of using the macrovariable citynumber you can directly use &N inside the do loop, but used with a parameter the macro is more flexible...

If you have numbered macro variables, you use &&varname&i to resolve them. Also, by having your cities in a dataset, you can create the macro variables off the back of it, rather than hard-coding them all (plus the count).
data cities ;
input City $20. ;
/* Create numbered macro variables based on _n_ */
call symputx(cats('CITY',_n_),City) ;
call symputx('N',_n_) ;
datalines ;
Atlanta
Boston
Chicago
Houston
Texas
;
run ;
%MACRO LOOP ;
%DO I = 1 %TO &N ;
data &&CITY&I..Complete ;
merge dir.&&CITY&I..1
dir.&&CITY&I..2 ;
by mergevar ;
run ;
%END ;
%MEND ;
%LOOP ;

Related

Proc compare loop through multiple datasets in a folder SAS

I have two folders containing numerous datasets. Each folder contains identical datasets and I'd like to compare to ensure they are similar. Is it possible to loop through each folder and compare each dataset?
%macro compare(dpath=, cpath=,);
%do i = 1 %to n;
proc compare base = &dpath data = &cpath;
run;
%mend;
%compare(dpath=folder1_path, cpath=folder2_path);
Point librefs to the "folders". Get the lists of datasets. Use the list to drive the code generation.
%macro compare(dpath,cpath);
libname left "&dpath";
libname right "&cpath";
proc contents data=left._all_ noprint out=contents;
run;
data _null_;
set contents;
by membname;
if first.memname;
call execute(catx(' '
,'proc compare base=',cats('left.',memname)
,'compare=',cats('right.',memname)
,';run;'
));
run;
%mend;
%compare
(dpath=folder1_path
,cpath=folder2_path
);
To make it more robust you might want to do things like check that the member names in LEFT actually match the member names in RIGHT. Or add an ID statement to the generated PROC COMPARE code so that PROC COMPARE knows how to match observations, otherwise it will just match the observations in the order they appear.
This macro will compare the contents of all data exactly as-is and output if there are any differences at all. If there are no differences, the dataset all_differences will not be created.
%macro compare(dpath=, cpath=);
libname d "&dpath";
libname c "&cpath";
/* Save all datasets to macro variables:
&dataset1 &dataset2 etc. */
data _null_;
set sashelp.vmember;
where libname = 'D';
call symputx(cats('name', _N_), memname);
call symputx('n', _N_);
run;
proc datasets lib=work nolist;
delete all_differences;
quit;
%do i = 1 %to &n;
/* Compare dataset names and only output if they are unequal */
proc compare base = d.&&name&i
compare = c.&&name&i
out = outcomp
outnoequal;
run;
/* Get the number of obs from outcomp */
%let dsid = %sysfunc(open(outcomp));
%let nobs = %sysfunc(attrn(&dsid, nlobs));
%let rc = %sysfunc(close(&dsid));
/* If outcomp is populated, log the dataset with differences */
%if(&nobs > 0) %then %do;
data _difference_;
length dsn $32.;
dsn = "&&name&i.";
run;
proc append base=all_differences
data=_difference_
force;
run;
%end;
%end;
%mend;

How to rank multiple variables in a large data set?

I have a data set of around 50 million records with around 30 variables(columns).
I need to rank the dataset for each variable.
Proc rank does not work since it required lot of memory for this large dataset.
To give rank manually, I have to sort the dataset on the respective variable column and then give rank by using a formula. But the problem is we have to sort the dataset 30 times on 30 variables which will take very very long time and not feasible.
What alternates can we use in this case?
You're in a tough spot without many options. If you're sorting and keeping all 30 variables each time, that will significantly increase your processing times. If I were you, I'd only keep the variable you want to rank and a sequence number to apply your formula, then merge it all back together at the end. This would require you to loop over each variable in your dataset then merge it all back together. See example below and if it would help decrease your processing times:
** PUT ALL VARIABLES INTO LIST **;
PROC SQL NOPRINT;
SELECT DISTINCT(NAME)
INTO :VARS SEPARATED BY " "
FROM DICTIONARY.COLUMNS
WHERE LIBNAME = 'SASHELP' AND MEMNAME = 'FISH';
QUIT;
%PUT &VARS.;
** CREATE SEQUENCE NUMBER IN FULL DATA **;
DATA FISH; SET SASHELP.FISH;
SEQ=_N_;
RUN;
** LOOP OVER EACH VARIABLE TO ONLY PROCESS THAT VARIABLE AND SEQUENCE -- REDUCES PROCESSING TIME **;
%MACRO LOOP_OVER(VARS);
%DO I=1 %TO %SYSFUNC(COUNTW(&VARS.));
%LET VAR = %SCAN(&VARS,&I.);
DATA FISH_&I.; SET FISH (KEEP=SEQ &VAR.);
RUN;
/* INSERT YOUR FORMULA CODE HERE ON FISH_&I. DATA (MINE IS AN EXAMPLE) */
PROC SORT DATA = FISH_&I.;
BY &VAR.;
RUN;
DATA FISH1_&I.; SET FISH_&I.;
BY &VAR.;
RANK_&VAR = _N_;
RUN;
/* RESORT FINAL DATA BY SEQUENCE NUMBER VARIABLE */
PROC SORT DATA = FISH1_&I.;
BY SEQ;
RUN;
%END;
%MEND;
%LOOP_OVER(&VARS.);
** MERGE ALL SUBSETS BACK TOGETHER BY THE ORIGINAL SEQUENCE NUMBER **;
DATA FINAL;
MERGE FISH1_:;
BY SEQ;
DROP SEQ;
RUN;
If you just need to rank into deciles / percentiles etc rather than a complete ranking from 1 to 50m across all 50m rows, you should be able to get a very good approximation of the correct answer using a much smaller amount of memory via proc summary, using qmethod=P2 and specifying a suitable qmarkers setting.
This approach uses the P-squared algorithm:
http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf
I am not sure, whether it is a good idea: But you may want to use a Hash object. The object is loaded into your RAM. Assuming that you have 30 Mio of numerical observations, you will need around (2*8bytes)*50 mio = 800MB of RAM -- if I am not mistaken.
The code could look like this (using Foxers Macro to loop over the variables, a little helper macro to get the list of variables from a dataset and a small test dataset with two variables):
%Macro GetVars(Dset) ;
%Local VarList ;
/* open dataset */
%Let FID = %SysFunc(Open(&Dset)) ;
/* If accessable, process contents of dataset */
%If &FID %Then %Do ;
%Do I=1 %To %SysFunc(ATTRN(&FID,NVARS)) ;
%Let VarList= &VarList %SysFunc(VarName(&FID,&I));
%End ;
/* close dataset when complete */
%Let FID = %SysFunc(Close(&FID)) ;
%End ;
&VarList
%Mend ;
data dsn;
input var1 var2;
datalines;
1 48
1 8
2 5
2 965
3 105
4 105
3 85
;
run;
%MACRO LOOP_OVER(VARS);
%DO I=1 %TO %SYSFUNC(COUNTW(&VARS.));
%LET var = %SCAN(&VARS,&I.);
data out&i.(keep=rank&i.);
if 0 then set dsn;
if _N_ =1 then
do;
dcl hash hh(ordered:'A');
dcl hiter hi('hh');
hh.definekey("&var.");
hh.definedata("&var.","rank&i.");
hh.definedone();
end;
/*Get unique combination variable and point in dataset*/
do while(not last);
set dsn end=last;
hh.ref();
end;
/*Assign ranks within hash object*/
rc=hi.first();
k = 1;
do while(rc=0);
rank&i.=k;
hh.replace();
k+1;
rc=hi.next();
end;
/*Output rank to new dataset in original order of observations*/
do while(not theend);
set dsn end=theend;
hh.find();
output;
end;
/*If data can be sorted according to the rank (with no duplicates) use:
hh.output("out&i.");
&outi. will then have variables &var. and rank&i.
However, the merging below may not be sensible anymore
as correspondence between variables is not preserved.
There will also be no duplicates in the dataset.
*/
run;
%END;
%MEND LOOP_OVER;
%LOOP_OVER(%GetVars(dsn));
/*Merge all rank datasets to one large*/
data all;
merge out:;
run;

SAS - Proc Means and Granularity of Search

I am new to SAS would like my searches to be a more granular.
In this example, I would like my output to show the variable SalesPrice and only those with CentralAir (another variable). I would like to show the statistics for another variable. In this case I want to view the OverallQal variable only if the variable is over 7 and the observations have '1FAM' as its BldgType (which is another variable).
I understand my syntax is incorrect. Any guidance would be appreciated. Thank you!
proc means data=MYDATA.AMES_HOUSING_DATA n nmiss p1 p10 q1 mean q3 p90 stddev median;
var SalePrice if (CentralAir = 'Y');
var OverallQual if (OverallQual GT 7 AND BldgType = '1FAM');
run;
Use a WHERE statement (or the WHERE= dataset option) to limit the records that the proc uses. You can only use one WHERE clause per procedure though, so you would need to run it twice to select the two different sets of records. You also might want to use PROC UNIVARIATE to get a summary of your variables' distribution.
proc univariate data=MYDATA.AMES_HOUSING_DATA ;
where CentralAir = 'Y';
var SalePrice;
run;
proc univariate data=MYDATA.AMES_HOUSING_DATA ;
where OverallQual GT 7 AND BldgType = '1FAM' ;
var OverallQual ;
run;

SAS: Improving the speed of a do loop with proc import

I have over 3400 CSV files, with size varying between 10kb to 3mb. Each CSV files have this generic filename: stockticker-Ret.csv where stockticker is the stock ticker like AAPL, GOOG, YHOO, and so on and has stock returns for at every minute on a given day. My SAS code first start by loading all the stock ticker names from the stockticker-Ret.csv file in a SAS dataset. I loop over each ticker to load the appropriate .csv file in a SAS dataset called want and apply some datasteps on want and store the final dataset want of each ticker in a SAS dataset called global. As you can imagine, this process takes a long time. Is there a way to improve my DO LOOP code below to make this process go faster?
/*Record in a sas dataset all the csv file name to extract the stock ticker*/
data yfiles;
keep filename;
length fref $8 filename $80;
rc = filename(fref, 'F:\data\');
if rc = 0 then do; did = dopen(fref);
rc = filename(fref); end; else do; length msg $200.; msg = sysmsg(); put msg=; did = .; end;
if did <= 0 then putlog 'ERR' 'OR: Unable to open directory.';
dnum = dnum(did);
do i = 1 to dnum; filename = dread(did, i); /* If this entry is a file, then output. */ fid = mopen(did, filename); if fid > 0 then output; end;
rc = dclose(did);
run;
/*store in yfiles all the stock tickers*/
data yfiles(drop=filename1 rename=(filename1=stock));
set yfiles;
filename1=tranwrd(filename,'-Ret.csv','');
run;
proc sql noprint;
select stock into :name separated by '*' from work.yfiles;
%let count2 = &sqlobs;
quit;
*Create the template of the desired GLOBAL SAS dataset;
proc sql;
create table global
(stock char(8), time_gap num(5), avg_ret num(5));
quit;
proc sql;
insert into global
(stock, time_gap,avg_ret)
values('',0,0);
quit;
%macro y1;
%do i = 1 %to &count2;
%let j = %scan(&name,&i,*);
proc import out = want datafile="F:\data\&j-Ret.csv"
dbms=csv replace;
getnames = yes;
run;
data want;
set want; ....
....[Here I do 5 Datasteps on the WANT sasfile]
/*Store the want file in a global SAS dataset that will contain all the stock tickers from the want file*/
data global;
set global want; run;
%end;
%mend y1;
%y1()
As you can see the global SAS dataset expands for every want dataset that I store in global.
Assuming the files have a common layout, you should not import them with PROC IMPORT or do loops. You should read them all in with one datastep. IE:
data want;
length the_file $500;
infile "f:\data\*.csv" dlm=',' lrecl=32767 dsd truncover firstobs=2 filename=the_file;
input
myvar1 myvar2 myvar3 myvar4;
stock_ticker=scan(the_file,'\',-1); *or whatever gets you the ticker name;
run;
Now, if they don't have identical layouts, or there is some complexity to the readin, you may need a more complex input statement than that, but almost always you can achieve it this way. Do loops with lots of PROC IMPORTs will always be inefficient because of the overhead of the IMPORT.
If you don't want every .csv file in the folder (and can't write a mask for what you do want), or if you have a subset of layouts, you can use the FILEVAR option to read the files in from a common dataset. You could then branch into various input statements, perhaps, if needed.
data want;
set yfiles;
infile a filevar=filename;
if filevar [some rule] then do;
input ... ;
end
;else if ... then do;
input ... ;
end;
run;

Do-loop in SAS-IML

I want to use a macro do loop inside proc iml like so:
%Let Tab1=FirstTable;
%Let Tab2=SecondTable;
%Let Tab3=ThirdTable;
*&Tab1-3 have been initialised as sas datasets;
proc iml;
* This works;
use &Tab1;
read all into Mat3;
print Mat3;
* This doesn't work;
%Macro Define_mx;
%do i=1 %to 2;
use &Tab&i;
read all into Mat&i ;
%end;
%Mend Define_mx;
%Define_mx;
*The two matrixes have not been initialised;
print Mat1;
print Mat2;
quit;
In reality I will have to initialise like 50 matrixes so a do-loop is necessary.
I can't figure out why the loop can't see &Tab&i as a macro variable.
I also tried a workaround with a normal (non-macro) do-loop using substr to concatenate the variable names but it didn't work either. What am I missing here ?
Ok, so the macro should be:
%Macro Define_mx;
%do i=1 %to 2;
use &&Tab&i;
read all into Mat&i ;
%end;
%Mend Define_mx;
%Define_mx;
The second amperstand on Tab is necessary as without it the macro processor would try to interpret &Tab as a macro variable (which does not exist). Thus, when trying to concatenate multiple macro variables to create a new one, use &&.
If you have SAS/IML 12.1 (released with 9.3m2), there is an even simpler way.
The USE statement supports dereferencing data set names, like this:
ds = "MyData";
use (ds);
Furthermore, as I show in my article on using the VALSET function, the SAS/IML language supports the VALSET function, which can dynamically create matrices named Mat1, Mat2, and so on.
You can combine these features to eliminate the macros entirely:
data a b c; /* create sample data sets */
x=1;y=2; output;
x=2;y=3; output;
run;
proc iml;
dsnames = {a b c}; /* names of data sets */
do i = 1 to ncol(dsnames);
use (dsnames[i]); /* open each data set */
read all into X;
close (dsname);
MatName = "Mat"+strip(char(i)); /* create Mat1, Mat2,... */
call valset(MatName, X); /* assign values from data set */
end;
show names;

Resources