data have
Game col2 col3 col4 ..
ABC
AZA
CGG
EDD
I need sort data HAVE by Game. But for output dataset WANT, the order should always be
Game col2 col3 col4 ..
AZA
ABC
EDD
CGG
How to achieve this in SAS? Also, the required order stored in an external file. If the required order changes, I need to adjust my code. So I want an efficient way to do this.
You could create an informat as per the below and sort on the resulting values.
PROC FORMAT;
INVALUE SEX
'AZA' = 1
'ABC' = 2
'EDD' = 3
'CGG' = 4
;
RUN;
DATA HAVE;
TEST = "CGG";
OUTPUT;
TEST = "EDD";
OUTPUT;
TEST = "AZA";
OUTPUT;
RUN;
DATA WANT;
SET HAVE;
TEST2 = INPUT(TEST,SEX.);
RUN;
PROC SORT DATA = WANT;
BY TEST2;
RUN;
Related
I have two datasets data1 and data2 with an id column. I want to assign a random id to each id, but this random number needs to be consistent across datasets. (rand_id for id=1 must be the same in both datasets). The objective is to get:
id
rand_id
1
0.4212
2
0.5124
3
0.1231
id
rand_id
1
0.4212
3
0.1231
2
0.5124
4
0.9102
Note that Id's do not need to be ordered, and some Id's might appear in one dataset but not at the other one. I thought
DATA data1;
SET data1;
CALL STREAMINIT(id);
rand_id=RAND('uniform');
RUN;
and the same for data2 would do the job, but it does not. It just takes as seed the first id and generates a sequence of random numbers.
From the STREAMINIT documentation, it seems it's only called once per data setp. I'd like to be called it in every row. Is this possible?
The idea is to create a table random_values with an associated random id for each id that we later join on the two tables.
*assign random seed;
%let random_seed = 71514218;
*list of unique id;
proc sql;
create table unique_id as
select distinct id
from (
select id from have1
union all
select id from have2
)
;
quit;
*add random values;
data random_values;
set unique_id;
call streaminit(&random_seed.);
rand = rand('uniform', 0, 1);
run;
*join back on have1;
proc sql;
create table have1 as
select t1.id, t2.rand as rand_id
from have1 t1 left join random_values t2
on t1.id = t2.id
;
quit;
*join back on have2;
proc sql;
create table have2 as
select t1.id, t2.rand as rand_id
from have2 t1 left join random_values t2
on t1.id = t2.id
;
quit;
Why not use a lookup dataset. You could create/update it using HASH object.
First make an empty dataset:
data rand_id;
set one(keep=id);
rand_id=.;
stop;
run;
Then process the first dataset. Adding the new RAND_ID variable to that dataset and also populating the RAND_ID dataset with all of the unique ID values.
data one_random;
if _n_=1 then do;
declare hash h(dataset:'rand_id');
rc=h.definekey('id');
rc=h.definedata('id','rand_id');
rc=h.definedone();
end;
if eof then rc=h.output(dataset:'rand_id');
set one end=eof;
if h.find() then do;
rand_id=rand('uniform');
rc=h.add();
end;
drop rc;
run;
Repeat for any other datasets that share the same ID variable.
data two_random;
if _n_=1 then do;
declare hash h(dataset:'rand_id');
rc=h.definekey('id');
rc=h.definedata('id','rand_id');
rc=h.definedone();
end;
if eof then rc=h.output(dataset:'rand_id');
set two end=eof;
if h.find() then do;
rand_id=rand('uniform');
rc=h.add();
end;
drop rc;
run;
Simplest way to do this in my opinion is to create a format dataset. Tom's hash example is fine also, but this is probably easier if you don't know hash tables.
Do NOT seed the random number from the ID itself - this is not random anymore.
data forfmt;
set data1;
call streaminit(7);
label = put(rand('Uniform'),12.9);
start = id;
fmtname = 'RANDIDF';
output;
if _n_ eq 1 then do;
hlo='o';
label='.';
output;
end;
run;
proc format cntlin=forfmt;
quit;
Then you can use put(id,randidf.) to assign the random ID (and use input instead of put and make it an informat, if you want it to be numeric, that's handled via type='i'; and needs the input to be character or turned into character via put). No sorting required, very fast lookup most of the time.
Solved:
DATA data1;
SET data1;
seed = id;
CALL RANUNI(seed,rand_id);
DROP seed;
RUN;
Generates the desired result.
I am stuck with a the following problem in SAS. I have a dataset of this format:
The dataet consists of 500ids with different number of observations per ID. I'm trying to randomly select 5id's and at the same time retain all of their observations. I built a random generator in the first place saving a vector with 10 numbers in the interval [1,500]. However it became clumpsy when I tried to use this vector in order to select the ids correspoding to the vector with the random numbers. To be more clear, I want my net result to be a dataset which includes all observations correspoding to ID 1,10,43, 22, 67, or any other sequence of 5 numbers.
Any tip will be more than appreciated!
From your question, I assume you already have your 10 random numbers. If they are saved in a table/dataset, you can run a left join between them and your original dataset, by id. This will pull out all the original observations with the same id.
Let's say that your ramdonly selected numbers are saved in a table called "random_ids". Then, you can do:
proc sql;
create table want as
select distinct
t1.id,
t2.*
from random_ids as t1
left join have as t2 on t1.id = t2.id;
quit;
If your random numbers are not saved in a dataset, you may simply copy them to a where statement, like:
proc sql;
create table want as
select distinct
*
from have
where id in (1 10 43 22 67) /*here you put the ids you want*/
quit;
Best,
Proc SURVEYSELECT is your friend.
data have;
call streaminit(123);
do _n_ = 1 to 500;
id = rand('integer', 1e6);
do seq = 1 to rand('integer', 35);
output;
end;
end;
run;
proc surveyselect noprint data=have sampsize=5 out=want;
cluster id;
run;
proc sql noprint;
select count(distinct id) into :id_count trimmed from want;
%put NOTE: &=id_count;
If you don't have the procedure as part of your SAS license, you can do sample selection per k/n algorithm. NOTE: Earliest archived post for k/n is May 1996 SAS-L message which has code based on a 1995 SAS Observations magazine article.
proc sql noprint;
select count(distinct id) into :N trimmed from have;
proc sort data=have;
by id;
data want_kn;
retain N &N k 5;
if _n_ = 1 then call streaminit(123);
keep = rand('uniform') < k / N;
if keep then k = k - 1;
do until (last.id);
set have;
by id;
if keep then output;
end;
if k = 0 then stop;
N = N - 1;
drop k N keep;
run;
proc sql noprint;
select count(distinct id) into :id_count trimmed from want_kn;
%put NOTE: &=id_count;
I am recieving information from a csv file from one department to compare with the same inforation in a different department to check for discrepencies (About 3/4 of a million rows of data with 44 columns in each row). After I have the data in a table, I have a program that will take the data and send reports based on a HQ. I feel like the way I am going about this is not the most efficient. I am using oracle for this comparison.
Here is what I have:
I have a vb.net program that parses the data and inserts it into an extract table
I run a procedure to do a full outer join on the two tables into a new table with the fields in one department prefixed with '_c'
I run another procedure to compare the old/new data and update 2 different tables with detail and summary information. Here is code from inside the procedure:
DECLARE
CURSOR Cur_Comp IS SELECT * FROM T.AEC_CIS_COMP;
BEGIN
FOR compRow in Cur_Comp LOOP
--If service pipe exists in CIS but not in FM and the service pipe has status of retired in CIS, ignore the variance
If(compRow.pipe_num = '' AND cis_status_c = 'R')
continue
END IF
--If there is not a summary record for this HQ in the table for this run, create one
INSERT INTO t.AEC_CIS_SUM (HQ, RUN_DATE)
SELECT compRow.HQ, to_date(sysdate, 'DD/MM/YYYY') from dual WHERE NOT EXISTS
(SELECT null FROM t.AEC_CIS_SUM WHERE HQ = compRow.HQ AND RUN_DATE = to_date(sysdate, 'DD/MM/YYYY'))
-- Check fields and update the tables accordingly
If (compRow.cis_loop <> compRow.cis_loop_c) Then
--Insert information into the details table
INSERT INTO T.AEC_CIS_DET( Fac_id, Pipe_Num, Hq, Address, AutoUpdatedFl,
DateTime, Changed_Field, CIS_Value, FM_Value)
VALUES(compRow.Fac_ID, compRow.Pipe_Num, compRow.Hq, compRow.Street_Num || ' ' || compRow.Street_Name,
'Y', sysdate, 'Cis_Loop', compRow.cis_loop, compRow.cis_loop_c);
-- Update information into the summary table
UPDATE AEC_CIS_SUM
SET cis_loop = cis_loop + 1
WHERE Hq = compRow.Hq
AND Run_Date = to_date(sysdate, 'DD/MM/YYYY')
End If;
END LOOP;
END;
Any suggestions of an easier way of doing this rather than an if statement for all 44 columns of the table? (This is run once a week if it matters)
Update: Just to clarify, there are 88 columns of data (44 of duplicates to compare with one suffixed with _c). One table lists each field in a row that is different so one row can mean 30+ records written in that table. The other table keeps tally of the number of discrepencies for each week.
First of all I believe that your task can be implemented (and should be actually) with staight SQL. No fancy cursors, no loops, just selects, inserts and updates. I would start with unpivotting your source data (it is not clear if you have primary key to join two sets, I guess you do):
Col0_PK Col1 Col2 Col3 Col4
----------------------------------------
Row1_val A B C D
Row2_val E F G H
Above is your source data. Using UNPIVOT clause we convert it to:
Col0_PK Col_Name Col_Value
------------------------------
Row1_val Col1 A
Row1_val Col2 B
Row1_val Col3 C
Row1_val Col4 D
Row2_val Col1 E
Row2_val Col2 F
Row2_val Col3 G
Row2_val Col4 H
I think you get the idea. Say we have table1 with one set of data and the same structured table2 with the second set of data. It is good idea to use index-organized tables.
Next step is comparing rows to each other and storing difference details. Something like:
insert into diff_details(some_service_info_columns_here)
select some_service_info_columns_here_along_with_data_difference
from table1 t1 inner join table2 t2
on t1.Col0_PK = t2.Col0_PK
and t1.Col_name = t2.Col_name
and nvl(t1.Col_value, 'Dummy1') <> nvl(t2.Col_value, 'Dummy2');
And on the last step we update difference summary table:
insert into diff_summary(summary_columns_here)
select diff_row_id, count(*) as diff_count
from diff_details
group by diff_row_id;
It's just rough draft to show my approach, I'm sure there is much more details should be taken into account. To summarize I suggest two things:
UNPIVOT data
Use SQL statements instead of cursors
You have several issues in your code:
If(compRow.pipe_num = '' AND cis_status_c = 'R')
continue
END IF
"cis_status_c" is not declared. Is it a variable or a column in AEC_CIS_COMP?
In case it is a column, just put the condition into the cursor, i.e. SELECT * FROM T.AEC_CIS_COMP WHERE not (compRow.pipe_num = '' AND cis_status_c = 'R')
to_date(sysdate, 'DD/MM/YYYY')
That's nonsense, you convert a date into a date, simply use TRUNC(SYSDATE)
Anyway, I think you can use three single statements instead of a cursor:
INSERT INTO t.AEC_CIS_SUM (HQ, RUN_DATE)
SELECT comp.HQ, trunc(sysdate)
from AEC_CIS_COMP comp
WHERE NOT EXISTS
(SELECT null FROM t.AEC_CIS_SUM WHERE HQ = comp.HQ AND RUN_DATE = trunc(sysdate));
INSERT INTO T.AEC_CIS_DET( Fac_id, Pipe_Num, Hq, Address, AutoUpdatedFl, DateTime, Changed_Field, CIS_Value, FM_Value)
select comp.Fac_ID, comp.Pipe_Num, comp.Hq, comp.Street_Num || ' ' || comp.Street_Name, 'Y', sysdate, 'Cis_Loop', comp.cis_loop, comp.cis_loop_c
from T.AEC_CIS_COMP comp
where comp.cis_loop <> comp.cis_loop_c;
UPDATE AEC_CIS_SUM
SET cis_loop = cis_loop + 1
WHERE Hq IN (Select Hq from T.AEC_CIS_COMP)
AND trunc(Run_Date) = trunc(sysdate);
They are not tested but they should give you a hint how to do it.
This is my file:
Col1, Col2, Col3, Col4, Col5
I need only Col2 and Col3.
Currently I'm doing this:
a = load 'input' as (Col1:chararray,
Col2:chararray,
Col3:chararray,
Col4:chararray);
b = foreach a generate Col2, Col3;
Is there a way to do directly load only Col2 and Col3 instead of loading the whole input and then generate required columns?
Your method of only GENERATEing the columns you want is an effective way to do just what you ask. Remember that all of your data is stored on HDFS, and you're not loading it all into memory when you start your script. You still will have to read those bytes off the disk even if you are not keeping them around for use in your processing, so there is no performance advantage to never loading that data. The advantage comes in never having to send it to a reducer, which you have accomplished with your method.
In cases where Pig can tell that a column won't be used, it will "prune" it immediately, essentially doing for you what you did with your b = foreach a generate Col2, Col3;. This won't happen, however, if you are using a UDF that might access other fields, because Pig doesn't look inside the UDF to see if they get used. For example, suppose Col3 is an int. If you have
b = group a by Col2;
c = foreach b generate group, SUM(a.Col3);
then Pig will automatically prune the 1st and 4th columns for you, since it can see they're never used. However, if you instead did
b = group a by Col2;
c = foreach b generate group, COUNT(a);
then Pig can't prune, because it doesn't see inside the COUNT UDF and doesn't know that the other fields won't be used. When in doubt of whether Pig will do this pruning, you can use the foreach/generate method you already have. And Pig should print a diagnostic message when you start your script listing all the columns it was able to prune out.
If instead your problem is that you don't want to have to provide a full schema when you're interested in just a few columns, you can skip the schema entirely and put it in the GENERATE:
a = load 'input';
b = foreach a generate (chararray) $1 as Col2, (chararray) $2 as Col3;
In my current project we got several occasions where we had to implement a matching based on varying conditions. First a more detailed description of the Problem.
We got a table test:
key Value
1 10
1 -10
1 10
1 20
1 -10
1 10
2 10
2 -10
Now we want to apply a rule, so that inside a group (defined by value of key) pairs with a sum of 0 should be eliminated.
The expected result would be:
key value
1 10
1 20
Sort order is not relevant.
The following code is an example of our solution.
We want to eliminate observations with my_id 2 and 7 and additionaly 2 of the 3 Observations with amount 10.
data test;
input my_id alias $ amount;
datalines4;
1 aaa 10
2 aaa -10
3 aaa 8000
4 aaa -16000
5 aaa 700
6 aaa 10
7 aaa -10
8 aaa 10
;;;;
run;
/* get all possible matches represented by pairs of my_id */
proc sql noprint;
create table zwischen_erg as
select a.my_id as a_id,
b.my_id as b_id
from test as a inner join
test as b on (a.alias=b.alias)
where a.amount=-b.amount;
quit;
/* select ids of matches to eliminate */
proc sort data=zwischen_erg ;
by a_id b_id;
run;
data zwischen_erg1;
set zwischen_erg;
by a_id;
if first.a_id then tmp_id1 = 0;
tmp_id1 +1;
run;
proc sort data=zwischen_erg;
by b_id a_id;
run;
data zwischen_erg2;
set zwischen_erg;
by b_id;
if first.b_id then tmp_id2 = 0;
tmp_id2 +1;
run;
proc sql;
create table delete_ids as
select zwischen_erg1.a_id as my_id
from zwischen_erg1 as erg1 left join
zwischen_erg2 as erg2 on
(erg1.a_id = erg2.a_id and
erg1.b_id = erg2.b_id)
where tmp_id1 = tmp_id2
;
quit;
/* use delete_ids as filter */
proc sql noprint;
create table erg as
select a.*
from test as a left join
delete_ids as b on (a.my_id = b.my_id)
where b.my_id=.;
quit;
The algorithm seems to work, at least nobody found input data that caused a error.
But nobody could explain to me why it works and I dont understand in detail how it is working.
So i got a couple of questions.
Does this algorithm eliminate the pairs in a correct manner for all possible combinations of input data?
If it does work correct, how does the algorithm work in detail? Especially the part
where tmp_id1 = tmp_id2.
Is there a better algorithm to eliminate corresponding pairs?
Thanks in advance and happy coding
Michael
As an answer to your third question. The following approach seems simpler to me.
And probably more performant. (since i have no joins)
/*For every (absolute) value, find how many more positive/negative occurrences we have per key*/
proc sql;
create view V_INTERMEDIATE_VIEW as
select key, abs(Value) as Value_abs, sum(sign(value)) as balance
from INPUT_DATA
group by key, Value_abs
;
quit;
*The balance variable here means how many times more often did we see the positive than the negative of this value. I.e., how many of either the positive or the negative were we not able to eliminate;
/*Now output*/
data OUTPUT_DATA (keep=key Value);
set V_INTERMEDIATE_VIEW;
Value = sign(balance)*Value_abs; *Put the correct value back;
do i=1 to abs(balance) by 1;
output;
end;
run;
If you only want pure SAS (so no proc sql), you could do it as below. Note that the idea behind it remains the same.
data V_INTERMEDIATE_VIEW /view=V_INTERMEDIATE_VIEW;
set INPUT_DATA;
value_abs = abs(value);
run;
proc sort data=V_INTERMEDIATE_VIEW out=INTERMEDIATE_DATA;
by key value_abs; *we will encounter the negatives of each value and then the positives;
run;
data OUTPUT_DATA (keep=key value);
set INTERMEDIATE_DATA;
by key value_abs;
retain balance 0;
balance = sum(balance,sign(value));
if last.value_abs then do;
value = sign(balance)*value_abs; *set sign depending on what we have in excess;
do i=1 to abs(balance) by 1;
output;
end;
balance=0; *reset balance for next value_abs;
end;
run;
NOTE: thanks to Joe for some useful performance suggestions.
I don't see any bugs after a quick read. But "zwischen_erg" could have a lot of unnecessary many-to-many matches which would be inefficient.
This seems to work (but not guaranteed), and might be more efficient. Also shorter, so perhaps easier to see whats going on.
data test;
input my_id alias $ amount;
datalines4;
1 aaa 10
2 aaa -10
3 aaa 8000
4 aaa -16000
5 aaa 700
6 aaa 10
7 aaa -10
8 aaa 10
;;;;
run;
proc sort data=test;
by alias amount;
run;
data zwischen_erg;
set test;
by alias amount;
if first.amount then occurrence = 0;
occurrence+1;
run;
proc sql;
create table zwischen as
select
a.my_id,
a.alias,
a.amount
from zwischen_erg as a
left join zwischen_erg as b
on a.amount = (-1)*b.amount and a.occurrence = b.occurrence
where b.my_id is missing;
quit;