Assign a consistent random number to id in SAS across datasets - random

I have two datasets data1 and data2 with an id column. I want to assign a random id to each id, but this random number needs to be consistent across datasets. (rand_id for id=1 must be the same in both datasets). The objective is to get:
id
rand_id
1
0.4212
2
0.5124
3
0.1231
id
rand_id
1
0.4212
3
0.1231
2
0.5124
4
0.9102
Note that Id's do not need to be ordered, and some Id's might appear in one dataset but not at the other one. I thought
DATA data1;
SET data1;
CALL STREAMINIT(id);
rand_id=RAND('uniform');
RUN;
and the same for data2 would do the job, but it does not. It just takes as seed the first id and generates a sequence of random numbers.
From the STREAMINIT documentation, it seems it's only called once per data setp. I'd like to be called it in every row. Is this possible?

The idea is to create a table random_values with an associated random id for each id that we later join on the two tables.
*assign random seed;
%let random_seed = 71514218;
*list of unique id;
proc sql;
create table unique_id as
select distinct id
from (
select id from have1
union all
select id from have2
)
;
quit;
*add random values;
data random_values;
set unique_id;
call streaminit(&random_seed.);
rand = rand('uniform', 0, 1);
run;
*join back on have1;
proc sql;
create table have1 as
select t1.id, t2.rand as rand_id
from have1 t1 left join random_values t2
on t1.id = t2.id
;
quit;
*join back on have2;
proc sql;
create table have2 as
select t1.id, t2.rand as rand_id
from have2 t1 left join random_values t2
on t1.id = t2.id
;
quit;

Why not use a lookup dataset. You could create/update it using HASH object.
First make an empty dataset:
data rand_id;
set one(keep=id);
rand_id=.;
stop;
run;
Then process the first dataset. Adding the new RAND_ID variable to that dataset and also populating the RAND_ID dataset with all of the unique ID values.
data one_random;
if _n_=1 then do;
declare hash h(dataset:'rand_id');
rc=h.definekey('id');
rc=h.definedata('id','rand_id');
rc=h.definedone();
end;
if eof then rc=h.output(dataset:'rand_id');
set one end=eof;
if h.find() then do;
rand_id=rand('uniform');
rc=h.add();
end;
drop rc;
run;
Repeat for any other datasets that share the same ID variable.
data two_random;
if _n_=1 then do;
declare hash h(dataset:'rand_id');
rc=h.definekey('id');
rc=h.definedata('id','rand_id');
rc=h.definedone();
end;
if eof then rc=h.output(dataset:'rand_id');
set two end=eof;
if h.find() then do;
rand_id=rand('uniform');
rc=h.add();
end;
drop rc;
run;

Simplest way to do this in my opinion is to create a format dataset. Tom's hash example is fine also, but this is probably easier if you don't know hash tables.
Do NOT seed the random number from the ID itself - this is not random anymore.
data forfmt;
set data1;
call streaminit(7);
label = put(rand('Uniform'),12.9);
start = id;
fmtname = 'RANDIDF';
output;
if _n_ eq 1 then do;
hlo='o';
label='.';
output;
end;
run;
proc format cntlin=forfmt;
quit;
Then you can use put(id,randidf.) to assign the random ID (and use input instead of put and make it an informat, if you want it to be numeric, that's handled via type='i'; and needs the input to be character or turned into character via put). No sorting required, very fast lookup most of the time.

Solved:
DATA data1;
SET data1;
seed = id;
CALL RANUNI(seed,rand_id);
DROP seed;
RUN;
Generates the desired result.

Related

Randomly select 10 subjects and retain all of their observations

I am stuck with a the following problem in SAS. I have a dataset of this format:
The dataet consists of 500ids with different number of observations per ID. I'm trying to randomly select 5id's and at the same time retain all of their observations. I built a random generator in the first place saving a vector with 10 numbers in the interval [1,500]. However it became clumpsy when I tried to use this vector in order to select the ids correspoding to the vector with the random numbers. To be more clear, I want my net result to be a dataset which includes all observations correspoding to ID 1,10,43, 22, 67, or any other sequence of 5 numbers.
Any tip will be more than appreciated!
From your question, I assume you already have your 10 random numbers. If they are saved in a table/dataset, you can run a left join between them and your original dataset, by id. This will pull out all the original observations with the same id.
Let's say that your ramdonly selected numbers are saved in a table called "random_ids". Then, you can do:
proc sql;
create table want as
select distinct
t1.id,
t2.*
from random_ids as t1
left join have as t2 on t1.id = t2.id;
quit;
If your random numbers are not saved in a dataset, you may simply copy them to a where statement, like:
proc sql;
create table want as
select distinct
*
from have
where id in (1 10 43 22 67) /*here you put the ids you want*/
quit;
Best,
Proc SURVEYSELECT is your friend.
data have;
call streaminit(123);
do _n_ = 1 to 500;
id = rand('integer', 1e6);
do seq = 1 to rand('integer', 35);
output;
end;
end;
run;
proc surveyselect noprint data=have sampsize=5 out=want;
cluster id;
run;
proc sql noprint;
select count(distinct id) into :id_count trimmed from want;
%put NOTE: &=id_count;
If you don't have the procedure as part of your SAS license, you can do sample selection per k/n algorithm. NOTE: Earliest archived post for k/n is May 1996 SAS-L message which has code based on a 1995 SAS Observations magazine article.
proc sql noprint;
select count(distinct id) into :N trimmed from have;
proc sort data=have;
by id;
data want_kn;
retain N &N k 5;
if _n_ = 1 then call streaminit(123);
keep = rand('uniform') < k / N;
if keep then k = k - 1;
do until (last.id);
set have;
by id;
if keep then output;
end;
if k = 0 then stop;
N = N - 1;
drop k N keep;
run;
proc sql noprint;
select count(distinct id) into :id_count trimmed from want_kn;
%put NOTE: &=id_count;

SAS join (or insert) little table to big table

I have little problem. I have big table and few little table where little tables including part of fields from big table. How I can insert (or union) tables on the basis of if field is the same - set data, if little table not have field from big - set null/0 in big table.
Example:
data temp1;
infile DATALINES dsd missover;
input a b c d e f g;
CARDS;
1, 2, 3, 4,5,6
2, 3, , 5
3, 3
4,,3,2,3,
;
run;
data temp2;
infile DATALINES dsd missover;
input a c e g;
CARDS;
5, 2, 3, 4
6, 3, , 5
7, 3
;
run;
Is there an elegant method where if I insert temp2 to temp1 - missing fields in temp2 will set value of null in temp1?
Thank you for help!
That is exactly what SAS does by default.
data want ;
set have1 have2;
run;
It will match the variables by name and any variables that do not exist (in either source) will have missing values.
For better performance when appending a small table to a large table you should use PROC APPEND instead of a data step to avoid having to make new copy of the large dataset. That is more like an "insert". The FORCE option will allow the dataset to be different. But since the new data is being added to the old dataset any extra variables that appear only in HAVE2 will just be ignored and their values will be lost.
proc append base=have1 data=have2 force ;
run;
If you really did have to generate an actual INSERT statement (perhaps you are actually trying to generate SQL code to run in a foreign database) you might want to compare the metadata of the two datasets and find the common variables.
proc contents data=have1 out=cont1 noprint; run;
proc contents data=have2 out=cont2 noprint; run;
proc sql noprint;
select a.name into :varlist separated by ','
from cont2 a
inner join cont1 b
on upcase(a.name) = upcase(b.name)
;
...
insert into have1 (&varlist) select &varlist from have2 ;
It is not very clear to me what operation you intend to do but some initial thoughts are:
To compare columns between two datasets (and check whether a value exists in one of them) it is good practice to use an outer join. You can do joins via MERGE clause in a datastep, or more elegantly use PROC SQL.
However, using either approach you will have to specify which two rows in temp1 and temp2 shall be compared - you are typically joining on a column that is available in both tables.
To help us resolve your issue, could you possibly provide the correct output for your desired operation, if you perform it on temp1 and temp2? This would show what options you've explored and what needs to be fixed there.
you should try proc append.that will be more efficient because you will not reading your big table again and again unlike in
/*reads temp1 which is big table and temp2*/
data temp3;
set temp1 temp2;
run;
/* this does pretty much same as above code but will not read your big table
and will be efficient*/
proc append base=temp1 data=temp2 force;
run;
more on proc append in documentation http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#n19kwc3onglzh2n1l2k4e39edv3x.htm

How to add running ID in a single UPDATE statement (Oracle)

let's assume I have a table tab1 in my Oracle DB 12.1, which has a column record_id (type NUMBER) and many other columns, among them a column named exchg_id.
This record_id is always empty when a batch of new rows gets inserted into the table. What I need to do is to populate the record_id with values 1..N for all rows that satisfy a condition ...WHERE EXCHG_ID = 'something' and number of such rows is N. Of course I know how to do this procedurally (in a for-loop), but I'd like to know if there's an faster way using a single UPDATE statement. I imagine something like this:
UPDATE tab1 SET record_id = {1..N} WHERE exchg_id = 'something';
Many thanks for your help!
UPDATE: the order of the rows is not important, I need no specific ordering. I just need unique record_id's 1..N for any given exchg_id.
You could use rownum to set record_id to 1 to N :
UPDATE tab1 SET record_id = rownum WHERE exchg_id = 'something';
If you have some offset, say 10, then use rownum + 10

Pick random observation for each by group in SAS

I have two datasets: form and pool having similar table structure.
a) Dataset form
b) Dataset pool
The above two datsets have three columns where each Key and Level combination in form dataset have 4 rows.
Also Sub-Level data in form and pool data is mutually exclusive at Key and Level level.
Sub-Level values has a particular Level assigned to them i.e. Level and Sub-Level follows a hierarchy.
I want to populate the null values under Sub-Level column in form dataset with Sub-Level values in pool dataset where a Sub-Level in pool dataset shoud belong to same Key and Level of form dataset.
How can be this done in SAS ?
EDIT 1 : null values that are populated in form dataset must be distinct or unique within each by group of key, level.
Create a new row ID in form: key, level, new_id.
Random sort pool on group level and create the same id.
Left join by key, level, new_id and replace missing values.
As there are less or equal values per group in form compared to pool and all values are disjunct and unique you will complete form without generating duplicates.
Code example:
data form;
set form;
retain new_id;
if first.level then new_id = 0;
new_id + 1;
run;
data pool;
set pool;
ran_num = ranuni(12345); /* generate random uniform number */
run;
proc sort data=pool; by key level ran_num; run; /* sort by random number */
data pool;
set pool;
retain new_id;
if first.level then new_id = 0;
new_id + 1;
run;
proc sql;
create table form_full as
select a.key, a.level, coalescec(a.sub_level,b.sub_level) as sub_level
from form a
left join pool b
on a.key eq b.key and a.level eq b.level and a.new_id eq b.new_id
;
quit;
edit:
In case pool has fewer rows in a group than form,
change new_id in form:
data form;
set form;
retain new_id;
if first.level then new_id = 0;
if sub_level ne . then new_id + 1;
run;

How to put more than 1000 values into an Oracle IN clause [duplicate]

This question already has answers here:
SQL IN Clause 1000 item limit
(5 answers)
Closed 8 years ago.
Is there any way to get around the Oracle 10g limitation of 1000 items in a static IN clause? I have a comma delimited list of many of IDs that I want to use in an IN clause, Sometimes this list can exceed 1000 items, at which point Oracle throws an error. The query is similar to this...
select * from table1 where ID in (1,2,3,4,...,1001,1002,...)
Put the values in a temporary table and then do a select where id in (select id from temptable)
select column_X, ... from my_table
where ('magic', column_X ) in (
('magic', 1),
('magic', 2),
('magic', 3),
('magic', 4),
...
('magic', 99999)
) ...
I am almost sure you can split values across multiple INs using OR:
select * from table1 where ID in (1,2,3,4,...,1000) or
ID in (1001,1002,...,2000)
You may try to use the following form:
select * from table1 where ID in (1,2,3,4,...,1000)
union all
select * from table1 where ID in (1001,1002,...)
Where do you get the list of ids from in the first place? Since they are IDs in your database, did they come from some previous query?
When I have seen this in the past it has been because:-
a reference table is missing and the correct way would be to add the new table, put an attribute on that table and join to it
a list of ids is extracted from the database, and then used in a subsequent SQL statement (perhaps later or on another server or whatever). In this case, the answer is to never extract it from the database. Either store in a temporary table or just write one query.
I think there may be better ways to rework this code that just getting this SQL statement to work. If you provide more details you might get some ideas.
Use ...from table(... :
create or replace type numbertype
as object
(nr number(20,10) )
/
create or replace type number_table
as table of numbertype
/
create or replace procedure tableselect
( p_numbers in number_table
, p_ref_result out sys_refcursor)
is
begin
open p_ref_result for
select *
from employees , (select /*+ cardinality(tab 10) */ tab.nr from table(p_numbers) tab) tbnrs
where id = tbnrs.nr;
end;
/
This is one of the rare cases where you need a hint, else Oracle will not use the index on column id. One of the advantages of this approach is that Oracle doesn't need to hard parse the query again and again. Using a temporary table is most of the times slower.
edit 1 simplified the procedure (thanks to jimmyorr) + example
create or replace procedure tableselect
( p_numbers in number_table
, p_ref_result out sys_refcursor)
is
begin
open p_ref_result for
select /*+ cardinality(tab 10) */ emp.*
from employees emp
, table(p_numbers) tab
where tab.nr = id;
end;
/
Example:
set serveroutput on
create table employees ( id number(10),name varchar2(100));
insert into employees values (3,'Raymond');
insert into employees values (4,'Hans');
commit;
declare
l_number number_table := number_table();
l_sys_refcursor sys_refcursor;
l_employee employees%rowtype;
begin
l_number.extend;
l_number(1) := numbertype(3);
l_number.extend;
l_number(2) := numbertype(4);
tableselect(l_number, l_sys_refcursor);
loop
fetch l_sys_refcursor into l_employee;
exit when l_sys_refcursor%notfound;
dbms_output.put_line(l_employee.name);
end loop;
close l_sys_refcursor;
end;
/
This will output:
Raymond
Hans
I wound up here looking for a solution as well.
Depending on the high-end number of items you need to query against, and assuming your items are unique, you could split your query into batches queries of 1000 items, and combine the results on your end instead (pseudocode here):
//remove dupes
items = items.RemoveDuplicates();
//how to break the items into 1000 item batches
batches = new batch list;
batch = new batch;
for (int i = 0; i < items.Count; i++)
{
if (batch.Count == 1000)
{
batches.Add(batch);
batch.Clear()
}
batch.Add(items[i]);
if (i == items.Count - 1)
{
//add the final batch (it has < 1000 items).
batches.Add(batch);
}
}
// now go query the db for each batch
results = new results;
foreach(batch in batches)
{
results.Add(query(batch));
}
This may be a good trade-off in the scenario where you don't typically have over 1000 items - as having over 1000 items would be your "high end" edge-case scenario. For example, in the event that you have 1500 items, two queries of (1000, 500) wouldn't be so bad. This also assumes that each query isn't particularly expensive in of its own right.
This wouldn't be appropriate if your typical number of expected items got to be much larger - say, in the 100000 range - requiring 100 queries. If so, then you should probably look more seriously into using the global temporary tables solution provided above as the most "correct" solution. Furthermore, if your items are not unique, you would need to resolve duplicate results in your batches as well.
Yes, very weird situation for oracle.
if you specify 2000 ids inside the IN clause, it will fail.
this fails:
select ...
where id in (1,2,....2000)
but if you simply put the 2000 ids in another table (temp table for example), it will works
below query:
select ...
where id in (select userId
from temptable_with_2000_ids )
what you can do, actually could split the records into a lot of 1000 records and execute them group by group.
Here is some Perl code that tries to work around the limit by creating an inline view and then selecting from it. The statement text is compressed by using rows of twelve items each instead of selecting each item from DUAL individually, then uncompressed by unioning together all columns. UNION or UNION ALL in decompression should make no difference here as it all goes inside an IN which will impose uniqueness before joining against it anyway, but in the compression, UNION ALL is used to prevent a lot of unnecessary comparing. As the data I'm filtering on are all whole numbers, quoting is not an issue.
#
# generate the innards of an IN expression with more than a thousand items
#
use English '-no_match_vars';
sub big_IN_list{
#_ < 13 and return join ', ',#_;
my $padding_required = (12 - (#_ % 12)) % 12;
# get first dozen and make length of #_ an even multiple of 12
my ($a,$b,$c,$d,$e,$f,$g,$h,$i,$j,$k,$l) = splice #_,0,12, ( ('NULL') x $padding_required );
my #dozens;
local $LIST_SEPARATOR = ', '; # how to join elements within each dozen
while(#_){
push #dozens, "SELECT #{[ splice #_,0,12 ]} FROM DUAL"
};
$LIST_SEPARATOR = "\n union all\n "; # how to join #dozens
return <<"EXP";
WITH t AS (
select $a A, $b B, $c C, $d D, $e E, $f F, $g G, $h H, $i I, $j J, $k K, $l L FROM DUAL
union all
#dozens
)
select A from t union select B from t union select C from t union
select D from t union select E from t union select F from t union
select G from t union select H from t union select I from t union
select J from t union select K from t union select L from t
EXP
}
One would use that like so:
my $bases_list_expr = big_IN_list(list_your_bases());
$dbh->do(<<"UPDATE");
update bases_table set belong_to = 'us'
where id in ($bases_list_expr)
UPDATE
Instead of using IN clause, can you try using JOIN with the other table, which is fetching the id. that way we don't need to worry about limit. just a thought from my side.
Instead of SELECT * FROM table1 WHERE ID IN (1,2,3,4,...,1000);
Use this :
SELECT * FROM table1 WHERE ID IN (SELECT rownum AS ID FROM dual connect BY level <= 1000);
*Note that you need to be sure the ID does not refer any other foreign IDS if this is a dependency. To ensure only existing ids are available then :
SELECT * FROM table1 WHERE ID IN (SELECT distinct(ID) FROM tablewhereidsareavailable);
Cheers

Resources