Eliminate pairs of observations under the condition, that observations can have more than one possible partner observation - algorithm

In my current project we got several occasions where we had to implement a matching based on varying conditions. First a more detailed description of the Problem.
We got a table test:
key Value
1 10
1 -10
1 10
1 20
1 -10
1 10
2 10
2 -10
Now we want to apply a rule, so that inside a group (defined by value of key) pairs with a sum of 0 should be eliminated.
The expected result would be:
key value
1 10
1 20
Sort order is not relevant.
The following code is an example of our solution.
We want to eliminate observations with my_id 2 and 7 and additionaly 2 of the 3 Observations with amount 10.
data test;
input my_id alias $ amount;
datalines4;
1 aaa 10
2 aaa -10
3 aaa 8000
4 aaa -16000
5 aaa 700
6 aaa 10
7 aaa -10
8 aaa 10
;;;;
run;
/* get all possible matches represented by pairs of my_id */
proc sql noprint;
create table zwischen_erg as
select a.my_id as a_id,
b.my_id as b_id
from test as a inner join
test as b on (a.alias=b.alias)
where a.amount=-b.amount;
quit;
/* select ids of matches to eliminate */
proc sort data=zwischen_erg ;
by a_id b_id;
run;
data zwischen_erg1;
set zwischen_erg;
by a_id;
if first.a_id then tmp_id1 = 0;
tmp_id1 +1;
run;
proc sort data=zwischen_erg;
by b_id a_id;
run;
data zwischen_erg2;
set zwischen_erg;
by b_id;
if first.b_id then tmp_id2 = 0;
tmp_id2 +1;
run;
proc sql;
create table delete_ids as
select zwischen_erg1.a_id as my_id
from zwischen_erg1 as erg1 left join
zwischen_erg2 as erg2 on
(erg1.a_id = erg2.a_id and
erg1.b_id = erg2.b_id)
where tmp_id1 = tmp_id2
;
quit;
/* use delete_ids as filter */
proc sql noprint;
create table erg as
select a.*
from test as a left join
delete_ids as b on (a.my_id = b.my_id)
where b.my_id=.;
quit;
The algorithm seems to work, at least nobody found input data that caused a error.
But nobody could explain to me why it works and I dont understand in detail how it is working.
So i got a couple of questions.
Does this algorithm eliminate the pairs in a correct manner for all possible combinations of input data?
If it does work correct, how does the algorithm work in detail? Especially the part
where tmp_id1 = tmp_id2.
Is there a better algorithm to eliminate corresponding pairs?
Thanks in advance and happy coding
Michael

As an answer to your third question. The following approach seems simpler to me.
And probably more performant. (since i have no joins)
/*For every (absolute) value, find how many more positive/negative occurrences we have per key*/
proc sql;
create view V_INTERMEDIATE_VIEW as
select key, abs(Value) as Value_abs, sum(sign(value)) as balance
from INPUT_DATA
group by key, Value_abs
;
quit;
*The balance variable here means how many times more often did we see the positive than the negative of this value. I.e., how many of either the positive or the negative were we not able to eliminate;
/*Now output*/
data OUTPUT_DATA (keep=key Value);
set V_INTERMEDIATE_VIEW;
Value = sign(balance)*Value_abs; *Put the correct value back;
do i=1 to abs(balance) by 1;
output;
end;
run;
If you only want pure SAS (so no proc sql), you could do it as below. Note that the idea behind it remains the same.
data V_INTERMEDIATE_VIEW /view=V_INTERMEDIATE_VIEW;
set INPUT_DATA;
value_abs = abs(value);
run;
proc sort data=V_INTERMEDIATE_VIEW out=INTERMEDIATE_DATA;
by key value_abs; *we will encounter the negatives of each value and then the positives;
run;
data OUTPUT_DATA (keep=key value);
set INTERMEDIATE_DATA;
by key value_abs;
retain balance 0;
balance = sum(balance,sign(value));
if last.value_abs then do;
value = sign(balance)*value_abs; *set sign depending on what we have in excess;
do i=1 to abs(balance) by 1;
output;
end;
balance=0; *reset balance for next value_abs;
end;
run;
NOTE: thanks to Joe for some useful performance suggestions.

I don't see any bugs after a quick read. But "zwischen_erg" could have a lot of unnecessary many-to-many matches which would be inefficient.
This seems to work (but not guaranteed), and might be more efficient. Also shorter, so perhaps easier to see whats going on.
data test;
input my_id alias $ amount;
datalines4;
1 aaa 10
2 aaa -10
3 aaa 8000
4 aaa -16000
5 aaa 700
6 aaa 10
7 aaa -10
8 aaa 10
;;;;
run;
proc sort data=test;
by alias amount;
run;
data zwischen_erg;
set test;
by alias amount;
if first.amount then occurrence = 0;
occurrence+1;
run;
proc sql;
create table zwischen as
select
a.my_id,
a.alias,
a.amount
from zwischen_erg as a
left join zwischen_erg as b
on a.amount = (-1)*b.amount and a.occurrence = b.occurrence
where b.my_id is missing;
quit;

Related

Assign a consistent random number to id in SAS across datasets

I have two datasets data1 and data2 with an id column. I want to assign a random id to each id, but this random number needs to be consistent across datasets. (rand_id for id=1 must be the same in both datasets). The objective is to get:
id
rand_id
1
0.4212
2
0.5124
3
0.1231
id
rand_id
1
0.4212
3
0.1231
2
0.5124
4
0.9102
Note that Id's do not need to be ordered, and some Id's might appear in one dataset but not at the other one. I thought
DATA data1;
SET data1;
CALL STREAMINIT(id);
rand_id=RAND('uniform');
RUN;
and the same for data2 would do the job, but it does not. It just takes as seed the first id and generates a sequence of random numbers.
From the STREAMINIT documentation, it seems it's only called once per data setp. I'd like to be called it in every row. Is this possible?
The idea is to create a table random_values with an associated random id for each id that we later join on the two tables.
*assign random seed;
%let random_seed = 71514218;
*list of unique id;
proc sql;
create table unique_id as
select distinct id
from (
select id from have1
union all
select id from have2
)
;
quit;
*add random values;
data random_values;
set unique_id;
call streaminit(&random_seed.);
rand = rand('uniform', 0, 1);
run;
*join back on have1;
proc sql;
create table have1 as
select t1.id, t2.rand as rand_id
from have1 t1 left join random_values t2
on t1.id = t2.id
;
quit;
*join back on have2;
proc sql;
create table have2 as
select t1.id, t2.rand as rand_id
from have2 t1 left join random_values t2
on t1.id = t2.id
;
quit;
Why not use a lookup dataset. You could create/update it using HASH object.
First make an empty dataset:
data rand_id;
set one(keep=id);
rand_id=.;
stop;
run;
Then process the first dataset. Adding the new RAND_ID variable to that dataset and also populating the RAND_ID dataset with all of the unique ID values.
data one_random;
if _n_=1 then do;
declare hash h(dataset:'rand_id');
rc=h.definekey('id');
rc=h.definedata('id','rand_id');
rc=h.definedone();
end;
if eof then rc=h.output(dataset:'rand_id');
set one end=eof;
if h.find() then do;
rand_id=rand('uniform');
rc=h.add();
end;
drop rc;
run;
Repeat for any other datasets that share the same ID variable.
data two_random;
if _n_=1 then do;
declare hash h(dataset:'rand_id');
rc=h.definekey('id');
rc=h.definedata('id','rand_id');
rc=h.definedone();
end;
if eof then rc=h.output(dataset:'rand_id');
set two end=eof;
if h.find() then do;
rand_id=rand('uniform');
rc=h.add();
end;
drop rc;
run;
Simplest way to do this in my opinion is to create a format dataset. Tom's hash example is fine also, but this is probably easier if you don't know hash tables.
Do NOT seed the random number from the ID itself - this is not random anymore.
data forfmt;
set data1;
call streaminit(7);
label = put(rand('Uniform'),12.9);
start = id;
fmtname = 'RANDIDF';
output;
if _n_ eq 1 then do;
hlo='o';
label='.';
output;
end;
run;
proc format cntlin=forfmt;
quit;
Then you can use put(id,randidf.) to assign the random ID (and use input instead of put and make it an informat, if you want it to be numeric, that's handled via type='i'; and needs the input to be character or turned into character via put). No sorting required, very fast lookup most of the time.
Solved:
DATA data1;
SET data1;
seed = id;
CALL RANUNI(seed,rand_id);
DROP seed;
RUN;
Generates the desired result.

Randomly select 10 subjects and retain all of their observations

I am stuck with a the following problem in SAS. I have a dataset of this format:
The dataet consists of 500ids with different number of observations per ID. I'm trying to randomly select 5id's and at the same time retain all of their observations. I built a random generator in the first place saving a vector with 10 numbers in the interval [1,500]. However it became clumpsy when I tried to use this vector in order to select the ids correspoding to the vector with the random numbers. To be more clear, I want my net result to be a dataset which includes all observations correspoding to ID 1,10,43, 22, 67, or any other sequence of 5 numbers.
Any tip will be more than appreciated!
From your question, I assume you already have your 10 random numbers. If they are saved in a table/dataset, you can run a left join between them and your original dataset, by id. This will pull out all the original observations with the same id.
Let's say that your ramdonly selected numbers are saved in a table called "random_ids". Then, you can do:
proc sql;
create table want as
select distinct
t1.id,
t2.*
from random_ids as t1
left join have as t2 on t1.id = t2.id;
quit;
If your random numbers are not saved in a dataset, you may simply copy them to a where statement, like:
proc sql;
create table want as
select distinct
*
from have
where id in (1 10 43 22 67) /*here you put the ids you want*/
quit;
Best,
Proc SURVEYSELECT is your friend.
data have;
call streaminit(123);
do _n_ = 1 to 500;
id = rand('integer', 1e6);
do seq = 1 to rand('integer', 35);
output;
end;
end;
run;
proc surveyselect noprint data=have sampsize=5 out=want;
cluster id;
run;
proc sql noprint;
select count(distinct id) into :id_count trimmed from want;
%put NOTE: &=id_count;
If you don't have the procedure as part of your SAS license, you can do sample selection per k/n algorithm. NOTE: Earliest archived post for k/n is May 1996 SAS-L message which has code based on a 1995 SAS Observations magazine article.
proc sql noprint;
select count(distinct id) into :N trimmed from have;
proc sort data=have;
by id;
data want_kn;
retain N &N k 5;
if _n_ = 1 then call streaminit(123);
keep = rand('uniform') < k / N;
if keep then k = k - 1;
do until (last.id);
set have;
by id;
if keep then output;
end;
if k = 0 then stop;
N = N - 1;
drop k N keep;
run;
proc sql noprint;
select count(distinct id) into :id_count trimmed from want_kn;
%put NOTE: &=id_count;

How to use CASE-operator for generating different sequences as per condition given

I want to generate a sequence by matching the condition given. I've two sequences in the case condition and depending on the test condition the query should generate the respective sequence. However even though the output is correct both the sequences are being generated and resulting in missed sequence issue. Is there any way that only the success test condition is executed. Below is the query used in oracle DB.
select CASE
WHEN :x=7
THEN seq1.NEXTVAL
ELSE seq2.NEXTVAL
END output from dual;
Suppose I pass x input as 7, I will get nextvalue of seq1 as output which is correct, however the nextvalue for seq2 is also generated in back end and missed the next time sequence is generated.
I need this condition for auditing.
You already know what's going on with your code. See if this helps.
First, create both sequences:
SQL> create sequence seq1;
Sequence created.
SQL> create sequence seq2;
Sequence created.
Now, create two functions, one for each sequence:
SQL> create or replace function f1 return number as begin return seq1.nextval; end;
2 /
Function created.
SQL> create or replace function f2 return number as begin return seq2.nextval; end;
2 /
Function created.
Run the select statement several times; once with input value 7 and several times with other values. But, don't select directly from the sequence - use functions instead:
SQL> select case when &x = 7 then f1
2 else f2
3 end result
4 from dual;
Enter value for x: 7
old 1: select case when &x = 7 then f1
new 1: select case when 7 = 7 then f1
RESULT
----------
1
SQL> /
Enter value for x: 2
old 1: select case when &x = 7 then f1
new 1: select case when 2 = 7 then f1
RESULT
----------
1
SQL> /
Enter value for x: 3
old 1: select case when &x = 7 then f1
new 1: select case when 3 = 7 then f1
RESULT
----------
2
SQL> /
Enter value for x: 4
old 1: select case when &x = 7 then f1
new 1: select case when 4 = 7 then f1
RESULT
----------
3
SQL> /
Enter value for x: 5
old 1: select case when &x = 7 then f1
new 1: select case when 5 = 7 then f1
RESULT
----------
4
OK; let's now check sequence's values:
SQL> select seq1.currval, seq2.currval from dual;
CURRVAL CURRVAL
---------- ----------
1 4
Aha! They aren't the same as they were using your code (i.e. having sequences in the select statement). Therefore, this might be a workaround for your problem.
However, sequences aren't to be used if you want gapless list of numbers. They will provide uniqueness, that's for sure, but - you most probably can't avoid gaps.

Postgres timeline simulator

I want to order search results by (age group, rank), and have age groups of 1 day, 1 week, 1 month, 6 months etc. I know I can get the "days old" with
SELECT NOW()::DATE - created_at::DATE FROM blah
and am thinking to do a CASE statement based on that, but am I barking up the right tree performance wise? Is there a nicer way?
You can also create separate table with intervals definition and labels. However this comes at cost of extra join to get the data.
create table distance (
d_start int,
d_end int,
d_description varchar
);
insert into distance values
(1,7,'1 week'),
(8,30,'1 month'),
(31,180,'6 months'),
(181,365,'1 year'),
(366,999999,'more than one year')
;
with
sample_data as (
select *
from generate_series('2013-01-01'::date,'2014-01-01'::date,'1 day') created_at
)
select
created_at,
d_description
from
sample_data sd
join distance d on ((current_date-created_at::date) between d.d_start and d.d_end)
;
Using this function to update an INT column stored on the table for performance reasons,and running an occasional update task. What's nice that way is that it's only necessary to run it against a small subset of the data once per hour (anything <~ 1 week old), and every 24 hours can just run it against anything > 1 week old (perhaps even a weekly task for even older stuff.)
CREATE OR REPLACE FUNCTION age_group(_date timestamp) RETURNS int AS
$$
DECLARE
days_old int;
age_group int;
BEGIN
days_old := current_date - _date::DATE;
age_group := CASE
WHEN days_old < 2 THEN 0
WHEN days_old < 8 THEN 1
WHEN days_old < 30 THEN 2
WHEN days_old < 90 THEN 3
ELSE 4
END;
RETURN age_group;
END;
$$
LANGUAGE plpgsql;

Oracle/PLSQL. Select only best match between two tables

I have two tables:
Vehicles
make model modification
Audi A 5 A 5 2010 Sportsback 2.8
Audi A 5 A 5 2012 Quattro L
Audi A 5 A 5 Cabriolet
and
matchingModel
make model modContain modEnd finalModel
Audi A 5 Sportback A5 Sportback
Audi A 5 L A5 L
Audi A 5 A5
My task is to get only best fitting finalModel by finding matches (can be seen in select below).
First i tried to join tables
(SELECT
matchingModel.finalModel
FROM vehicles
LEFT OUTER JOIN matchingModel ON
matchingModel.TEXT1 = vehicles.make
AND vehicles.model = nvl(matchingModel.model,vehicles.model)
AND vehicles.modification LIKE decode(matchingModel.modContain, NULL, vehicles.modification, '%'||matchingModel.modContain||'%')
AND vehicles.modification LIKE decode(matchingModel.modEnd, NULL, vehicles.modification, '%'||' '||matchingModel.modEnd)
)
AS bestMatch
but that did not work, because as Sportsback was found as sportsback, later its overwritten as a simple A5 because that matches too.
So next i made this happen simply by "nvling" all possible options: nvl(nvl(nvl(select where make, model fits and modContains is in the middle of Modification and option cell is empty), (select where make, model fits and modEnd is like ending of Modification and modEnd is not empty), (select where make and model fits AND so on)) AS Bestmatch
This works, but it is very slow (and both tables have more that 500k records).
This is just a part of very huge select, so its difficult to rewrite this normal way.
Anyway, the question is, are there any best practices how to get best match, only once, fast, in oracle? The problems i have run into, is performance, or values fits twice, or "where" clause does not work, because i can not know if modContain or modEnd is empty or not.
Thank You in advance.
Sorry for English.
It is not quite there yet but I worked out an example you can continue to work out for yourself: SQL Fiddle Demo
select * from (
(select
case when v.modification like '%'||m.modContain||'%' then 2
when m.modcontain is null then 1
else 0 end m1,
case when v.modification like '%'||m.modend then 2
when m.modend is null then 1
else 0 end m2
, m.make mmake, m.model mmodel, modcontain, modend, finalmodel
, v.make vmake, v.model vmodel, modification
from vehicles v, matchingmodel m
where
v.make = m.make
and soundex(v.model) = soundex(m.model) ) ) x
order by m1+m2 desc
So the sub-query adds together the matches and the highest match should be your best match. I also used soundex which may also help you because Sportback and Sportsback is not quite the same and that helped me to make A5 and A 5 make the same. Also to make it fast you will have to work a lot with assigning good indicies and watching the explain plan, especially if you have 500k records. That is not an easier undertaking.
To the idea about writing a procedure (which is a good idea) untested it might look like this:
create or replace function vehicle_matching(i_vehicles vehicles%rowtype,
i_matchingmodel matchingmodel%rowtype)
return number
is
l_return number;
begin
if i_vehicles.modification like '%'||i_matchingmodel.modContain||'%' then
l_return := 3;
elsif soundex(i_vehicles.modification) like '%'||soundex(i_matchingmodel.modContain)||'%' then
l_return := 2;
...
if i_vehicles.modification like '%'||i_matchingmodel.modend then
l_return := l_return + 1; -- there is no i++ in PL/SQL
elsif
...
return l_return;
end vehicle_matching;
Also I was thinking if it is more efficient to work with INSTR and SUBSTR than with the % but I actually do not really think this is the case.
you may consider something like this:
write a query to return 1 on any partial match
then write another query to return another 1 on another partial match - etc.
repeat this for all possible columns that count towards your 'similarity'
in the end, you will find the row with the highest sum (or count) of 1's and that will be the closest match.

Resources