Concatenating SAS datasets but preserving order of one dataset

Concatenating SAS datasets but preserving order of one dataset - sorting

I'm looking to add to a macro used as part of a standard process. The macro currently brings together multiple datasets from different product groups as below:
%macro test(group=);
data X;
set
%if &group = AAA %then %do;
LIB.AAA1
LIB.AAA2
LIB.AAA3
%end;
%else %if &group = BBB %then %do;
LIB.BBB1
LIB.BBB2
LIB.BBB3
%end;
%else %if &group = CCC %then %do;
LIB.CCC1
LIB.CCC2
LIB.CCC3
%end;
;
by customer key var1 var2;
if first.customer then do;
<logic>
end;
run;
%mend;
What I'm trying to achieve is inserting my own dataset and preserving its order to meet a new requirement. I also need to do this while changing as little of the standard macro above as possible so I don't affect the rest of the data and downstream processes.
In a separate program that is run before this macro, I have sorted my dataset with an extra variable in between customer and key. If I just sort by the variables above, my dataset is in the wrong order which will give the wrong first.customer results from the test macro. This extra variable type does not exist in any of the other datasets. I could use an existing variable, but unless I can isolate that too, it would affect the order of the other datasets which I don't want to touch.
The code I have so far:
%macro test(group=);
data X;
set
%if &group = AAA %then %do;
LIB.AAA1
LIB.AAA2
LIB.AAA3
%end;
%else %if &group = BBB %then %do;
LIB.MYDATA
LIB.BBB1
LIB.BBB2
LIB.BBB3
%end;
%else %if &group = CCC %then %do;
LIB.CCC1
LIB.CCC2
LIB.CCC3
%end;
;
%if &group = BBB %then
%let byvarlist = customer descending type key var1 var2;
%else
%let byvarlist = customer key var1 var2;
by &byvarlist.;
if first.customer then do;
<logic>
end;
run;
%mend;
The BY statement does include the new variable I want, but now of course I am getting the following SAS error for each dataset in group BBB:
ERROR: BY variable type is not on input data set LIB.BBB1.
Adding a length statement hasn't made a difference to the error, neither does the order of the list of BBB datasets (i.e. having MYDATA above BBB1 etc.). The other idea I had was to specify my dataset only in the %if ... %then logic, like %if &dataset. = LIB.MYDATA %then ..., but I'm not sure how to go about this and whether it would still work.
Is there any way of getting around this issue, so my dataset can be sorted further without changing the sort of the other datasets?

Didn't test the code, but you can try saving the set dataset into a temporary one, and then use by statement with more variables. Also sort the dataset X with given by variables before using
%macro test(group=);
data X_TEMP;
set
%if group = AAA %then %do;
LIB.AAA1
LIB.AAA2
LIB.AAA3
%end;
%else %if group = BBB %then %do;
LIB.MYDATA
LIB.BBB1
LIB.BBB2
LIB.BBB3
%end;
%else %if group = CCC %then %do;
LIB.CCC1
LIB.CCC2
LIB.CCC3
%end;
;
run;
%let byvarlist = customer descending type key var1 var2;
proc sort data=X_TEMP; by &byvarlist;
run;
data X;
set X_TEMP;
by &byvarlist.;
%if first.customer %then %do;
<logic>
%end;
run;
%mend;
Now X_TEMP will have all the variables including type(even though it is empty for other datasets)

Related

Assign a consistent random number to id in SAS across datasets

I have two datasets data1 and data2 with an id column. I want to assign a random id to each id, but this random number needs to be consistent across datasets. (rand_id for id=1 must be the same in both datasets). The objective is to get:
id
rand_id
1
0.4212
2
0.5124
3
0.1231
id
rand_id
1
0.4212
3
0.1231
2
0.5124
4
0.9102
Note that Id's do not need to be ordered, and some Id's might appear in one dataset but not at the other one. I thought
DATA data1;
SET data1;
CALL STREAMINIT(id);
rand_id=RAND('uniform');
RUN;
and the same for data2 would do the job, but it does not. It just takes as seed the first id and generates a sequence of random numbers.
From the STREAMINIT documentation, it seems it's only called once per data setp. I'd like to be called it in every row. Is this possible?

The idea is to create a table random_values with an associated random id for each id that we later join on the two tables.
*assign random seed;
%let random_seed = 71514218;
*list of unique id;
proc sql;
create table unique_id as
select distinct id
from (
select id from have1
union all
select id from have2
)
;
quit;
*add random values;
data random_values;
set unique_id;
call streaminit(&random_seed.);
rand = rand('uniform', 0, 1);
run;
*join back on have1;
proc sql;
create table have1 as
select t1.id, t2.rand as rand_id
from have1 t1 left join random_values t2
on t1.id = t2.id
;
quit;
*join back on have2;
proc sql;
create table have2 as
select t1.id, t2.rand as rand_id
from have2 t1 left join random_values t2
on t1.id = t2.id
;
quit;

Why not use a lookup dataset. You could create/update it using HASH object.
First make an empty dataset:
data rand_id;
set one(keep=id);
rand_id=.;
stop;
run;
Then process the first dataset. Adding the new RAND_ID variable to that dataset and also populating the RAND_ID dataset with all of the unique ID values.
data one_random;
if _n_=1 then do;
declare hash h(dataset:'rand_id');
rc=h.definekey('id');
rc=h.definedata('id','rand_id');
rc=h.definedone();
end;
if eof then rc=h.output(dataset:'rand_id');
set one end=eof;
if h.find() then do;
rand_id=rand('uniform');
rc=h.add();
end;
drop rc;
run;
Repeat for any other datasets that share the same ID variable.
data two_random;
if _n_=1 then do;
declare hash h(dataset:'rand_id');
rc=h.definekey('id');
rc=h.definedata('id','rand_id');
rc=h.definedone();
end;
if eof then rc=h.output(dataset:'rand_id');
set two end=eof;
if h.find() then do;
rand_id=rand('uniform');
rc=h.add();
end;
drop rc;
run;

Simplest way to do this in my opinion is to create a format dataset. Tom's hash example is fine also, but this is probably easier if you don't know hash tables.
Do NOT seed the random number from the ID itself - this is not random anymore.
data forfmt;
set data1;
call streaminit(7);
label = put(rand('Uniform'),12.9);
start = id;
fmtname = 'RANDIDF';
output;
if _n_ eq 1 then do;
hlo='o';
label='.';
output;
end;
run;
proc format cntlin=forfmt;
quit;
Then you can use put(id,randidf.) to assign the random ID (and use input instead of put and make it an informat, if you want it to be numeric, that's handled via type='i'; and needs the input to be character or turned into character via put). No sorting required, very fast lookup most of the time.

Solved:
DATA data1;
SET data1;
seed = id;
CALL RANUNI(seed,rand_id);
DROP seed;
RUN;
Generates the desired result.

How can I create a variable in SAS containing Ridit scores for each observation?

I have a variable which is a sum score ("fsum") and I need to calculate weighted ridit scores for each observation and save it as a new variable to allow me to use the ridits as a continuous variable in other analyses.
I tried using an out statement on the proc freq syntax, but of course it just saved the frequencies, not the ridit scores (see below)
proc freq data = ftest;
weight dataset_weight;
tables fsum / out = ridits scores = ridit;
run;

You can capture the RIDIT row scores using ODS OUTPUT. FREQ TABLES statement option SCOROUT tells the procedure to create ODS OUTPUT that includes ODS table name rowscores.
This example code creates two data sets work.freqdata and work.riditscores, the first is from FREQ TABLES statement, the other from an ODS OUTPUT statement:
data have;
do id = 1 to 100;
do test = 1 to 6;
grade = 60 + ceil(40 * ranuni(123));
array ws (6) _temporary_ (0.15, 0.15, 0.10, 0.15, 0.10, 0.35);
W = ws(test);
output;
end;
end;
run;
ods noresults; * prevent actual ODS destination generation, but still create ODS output tables;
ods output rowscores=work.riditscores; * capture the desired ODS output table;
proc freq data=have;
weight W;
tables grade*test
/ scores=ridit
all
scorout
out=freqdata
;
run;
ods results;
Not sure if this is wisest, but transpose into one row with new columns named ridit_for_<score>.
proc transpose data=riditscores out=ridit_across prefix=ridit_for_;
by table;
id grade;
var score;
run;

Efficient macro looping in SAS to get to Oracle Stored Procedure

I'm using SAS to access an Oracle database. The problem is that the function / stored procedure lives on one server in Oracle - which is fine when my data lives there too - but when the data is on a different server I still want to use that function. So I loaded some macros with the personal id's to pass them to the function in a loop. It works, but it's painfully slow. I don't need 'optimal', just 'reasonable'...my datasets will max around 100,000 rows. I've read that creating a dataset is one of the most resource intensive jobs in SAS, so I'm experimenting with creating an empty table and insert into, but I haven't noticed much gain yet.
So the question is - can I use the Oracle stored procedures for data on a different server in a reasonable amount of time within SAS? (Either by improving my existing approach or something completely different)
My first attempt (around 25 minutes for 13,000 personal id's):
%MACRO STATE() ;
options nosource nonotes;
%* 2. get macro max loop n;
proc sql noprint;
select left(put(count(distinct pidm),10.)) into :loopn from examp
;quit;
%* 3. load macros with the pidms of interest;
proc sql noprint;
select distinct pidm into :pidm1 - :pidm&loopn from examp order by pidm;
quit;
%Do i = 1 %TO &loopn ; /*build em */
%* %put **************LOOP &i OF &loopn *********************;
proc sql noprint;
connect to oracle as mycon(user=xxxxxx password=xxxxxxx path='PROD') ;
create table subsetdat&i as
select * from connection to mycon
(select %quote(&&pidm&i) as pidm ,UILIB.ADDR.STATE(&&pidm&i, 'MA') as state
from dual);
disconnect from mycon ;
; quit;
%END;
data state; set subsetdat1-subsetdat&loopn ; /*stack 'em */
%Do j = 1 %TO &loopn ; /*drop 'em */
proc sql ;
drop table subsetdat&j
;
%END;
options source notes;
%MEND STATE ;
options nomprint;
%STATE() ;

Move to loop inside the proc sql, thereby removing the overhead of creating multiple datasets from multiple pass-through queries, and use a union all to 'stack' the individual query results together.
%MACRO STATE() ;
options nosource nonotes;
/* 2. get macro max loop n; */
proc sql noprint;
select left(put(count(distinct pidm),10.)) into :loopn from examp
;quit;
/* 3. load macros with the pidms of interest; */
proc sql noprint;
select distinct pidm into :pidm1 - :pidm&loopn from examp order by pidm;
quit;
/* Build single pass-thru query with multiple select ... union all select ... etc */
proc sql noprint;
connect to oracle as mycon(user=xxxxxx password=xxxxxxx path='PROD') ;
create table state as
select * from connection to mycon
(%DO I = 1 %TO &loopn ; /*build em */
select %quote(&&pidm&i) as pidm ,UILIB.ADDR.STATE(&&pidm&i, 'MA') as state from dual
%IF &I lt &LOOPN %THEN %DO ; /* if not last iteration do a `union all` */
union all
%END ;
%END ;
) ;
disconnect from mycon ;
quit;
options source notes;
%MEND STATE ;
options nomprint;
%STATE() ;

Sort column by manual order in SAS

data have
Game col2 col3 col4 ..
ABC
AZA
CGG
EDD
I need sort data HAVE by Game. But for output dataset WANT, the order should always be
Game col2 col3 col4 ..
AZA
ABC
EDD
CGG
How to achieve this in SAS? Also, the required order stored in an external file. If the required order changes, I need to adjust my code. So I want an efficient way to do this.

You could create an informat as per the below and sort on the resulting values.
PROC FORMAT;
INVALUE SEX
'AZA' = 1
'ABC' = 2
'EDD' = 3
'CGG' = 4
;
RUN;
DATA HAVE;
TEST = "CGG";
OUTPUT;
TEST = "EDD";
OUTPUT;
TEST = "AZA";
OUTPUT;
RUN;
DATA WANT;
SET HAVE;
TEST2 = INPUT(TEST,SEX.);
RUN;
PROC SORT DATA = WANT;
BY TEST2;
RUN;

Eliminate pairs of observations under the condition, that observations can have more than one possible partner observation

In my current project we got several occasions where we had to implement a matching based on varying conditions. First a more detailed description of the Problem.
We got a table test:
key Value
1 10
1 -10
1 10
1 20
1 -10
1 10
2 10
2 -10
Now we want to apply a rule, so that inside a group (defined by value of key) pairs with a sum of 0 should be eliminated.
The expected result would be:
key value
1 10
1 20
Sort order is not relevant.
The following code is an example of our solution.
We want to eliminate observations with my_id 2 and 7 and additionaly 2 of the 3 Observations with amount 10.
data test;
input my_id alias $ amount;
datalines4;
1 aaa 10
2 aaa -10
3 aaa 8000
4 aaa -16000
5 aaa 700
6 aaa 10
7 aaa -10
8 aaa 10
;;;;
run;
/* get all possible matches represented by pairs of my_id */
proc sql noprint;
create table zwischen_erg as
select a.my_id as a_id,
b.my_id as b_id
from test as a inner join
test as b on (a.alias=b.alias)
where a.amount=-b.amount;
quit;
/* select ids of matches to eliminate */
proc sort data=zwischen_erg ;
by a_id b_id;
run;
data zwischen_erg1;
set zwischen_erg;
by a_id;
if first.a_id then tmp_id1 = 0;
tmp_id1 +1;
run;
proc sort data=zwischen_erg;
by b_id a_id;
run;
data zwischen_erg2;
set zwischen_erg;
by b_id;
if first.b_id then tmp_id2 = 0;
tmp_id2 +1;
run;
proc sql;
create table delete_ids as
select zwischen_erg1.a_id as my_id
from zwischen_erg1 as erg1 left join
zwischen_erg2 as erg2 on
(erg1.a_id = erg2.a_id and
erg1.b_id = erg2.b_id)
where tmp_id1 = tmp_id2
;
quit;
/* use delete_ids as filter */
proc sql noprint;
create table erg as
select a.*
from test as a left join
delete_ids as b on (a.my_id = b.my_id)
where b.my_id=.;
quit;
The algorithm seems to work, at least nobody found input data that caused a error.
But nobody could explain to me why it works and I dont understand in detail how it is working.
So i got a couple of questions.
Does this algorithm eliminate the pairs in a correct manner for all possible combinations of input data?
If it does work correct, how does the algorithm work in detail? Especially the part
where tmp_id1 = tmp_id2.
Is there a better algorithm to eliminate corresponding pairs?
Thanks in advance and happy coding
Michael

As an answer to your third question. The following approach seems simpler to me.
And probably more performant. (since i have no joins)
/*For every (absolute) value, find how many more positive/negative occurrences we have per key*/
proc sql;
create view V_INTERMEDIATE_VIEW as
select key, abs(Value) as Value_abs, sum(sign(value)) as balance
from INPUT_DATA
group by key, Value_abs
;
quit;
*The balance variable here means how many times more often did we see the positive than the negative of this value. I.e., how many of either the positive or the negative were we not able to eliminate;
/*Now output*/
data OUTPUT_DATA (keep=key Value);
set V_INTERMEDIATE_VIEW;
Value = sign(balance)*Value_abs; *Put the correct value back;
do i=1 to abs(balance) by 1;
output;
end;
run;
If you only want pure SAS (so no proc sql), you could do it as below. Note that the idea behind it remains the same.
data V_INTERMEDIATE_VIEW /view=V_INTERMEDIATE_VIEW;
set INPUT_DATA;
value_abs = abs(value);
run;
proc sort data=V_INTERMEDIATE_VIEW out=INTERMEDIATE_DATA;
by key value_abs; *we will encounter the negatives of each value and then the positives;
run;
data OUTPUT_DATA (keep=key value);
set INTERMEDIATE_DATA;
by key value_abs;
retain balance 0;
balance = sum(balance,sign(value));
if last.value_abs then do;
value = sign(balance)*value_abs; *set sign depending on what we have in excess;
do i=1 to abs(balance) by 1;
output;
end;
balance=0; *reset balance for next value_abs;
end;
run;
NOTE: thanks to Joe for some useful performance suggestions.

I don't see any bugs after a quick read. But "zwischen_erg" could have a lot of unnecessary many-to-many matches which would be inefficient.
This seems to work (but not guaranteed), and might be more efficient. Also shorter, so perhaps easier to see whats going on.
data test;
input my_id alias $ amount;
datalines4;
1 aaa 10
2 aaa -10
3 aaa 8000
4 aaa -16000
5 aaa 700
6 aaa 10
7 aaa -10
8 aaa 10
;;;;
run;
proc sort data=test;
by alias amount;
run;
data zwischen_erg;
set test;
by alias amount;
if first.amount then occurrence = 0;
occurrence+1;
run;
proc sql;
create table zwischen as
select
a.my_id,
a.alias,
a.amount
from zwischen_erg as a
left join zwischen_erg as b
on a.amount = (-1)*b.amount and a.occurrence = b.occurrence
where b.my_id is missing;
quit;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Concatenating SAS datasets but preserving order of one dataset - sorting

Related

Assign a consistent random number to id in SAS across datasets

How can I create a variable in SAS containing Ridit scores for each observation?

Efficient macro looping in SAS to get to Oracle Stored Procedure

Sort column by manual order in SAS

Eliminate pairs of observations under the condition, that observations can have more than one possible partner observation

Categories

Resources