Is there a way to change and manipulate the proportion of a variable in SAS in random sampling?
Lets say that I have table consisting 1000 people. (500 male and 500 female)
If I want to have a random sample of 100 with gender strata - I will have 50 males and 50 females in my output.
I want to learn if there is a way to have the desired proportion of gender values?
Can ı have a random sample of 100 with 70 males and 30 females ?
PROC SURVEYSELECT is the way to do this, using a dataset for n or samprate instead of a number.
data strata_to_Sample;
length sex $1;
input sex $ _NSIZE_;
datalines;
M 70
F 30
;;;;
run;
proc sort data=strata_To_sample;
by sex;
run;
data to_sample;
set sashelp.class;
do _i = 1 to 1e5;
output;
end;
run;
proc sort data=to_Sample;
by sex;
run;
proc surveyselect data=to_sample n=strata_to_sample out=sample;
strata sex;
run;
Generally that is what proc surveyselect is for.
But for a quick and dirty datastep solution:
data in_data;
do i= 1 to 500;
sex = 'M'; output;
sex = 'F'; output;
end;
run;
data in_data;
set in_data;
rannum = ranuni(12345);
run;
proc sort data= in_data; by rannum; run;
data sample_data;
set in_data;
retain count_m count_f 0;
if sex = 'M' and count_m lt 70 then do; count_m + 1; output; end;
else if sex = 'F' and count_f lt 30 then do; count_f + 1; output; end;
run;
proc freq data= sample_data;
table sex;
run;
Related
I need to Use SAS random number generation functions RAND() and a DO....END loop to create 100 obs in variable named X then I want to use another DO loop of 500 rounds to generate a total of 500 samples, each with 100 obs. a sample is basically sampling from a standard normal distribution.
I tried the following code but it does not give me what I need:
data A;
call streaminit(123); /* set random number seed */
do i = 1 to 100;
X = rand("Normal"); /* random number generator */
output;
end;
do r = 1 to 500 ;
if i then X = rand("Normal");
output;
end;
run;
Any input will be greatly appreciated.
Perfect time to use PROC IML:
proc iml;
call streaminit(123); /* set seed */
x = j(500, 100); /* allocate 500 by 100 matrix */
call randgen(x, "Normal"); /* fill matrix with N(0,1) random draws */
create mydata from x; /* move matrix to a dataset in the work directory */
append from x;
close mydata;
quit;
Here is a data step solution
data want;
do I=1 to 500;
do _iorc_=1 to 100;
X=rand ("normal");
output;
end;
end;
run;
How to create a function that returns a float(ChargeTotal)?
ChargeTotal is based on a progressive table using number of batches.
num of batches | charge
----------------------------
1-10 | 0
11-20 | 50
21-30 | 60
31-40 | 70
40+ | 80
If number of batches is 25 then
num of batches | charge
----------------------------
1-10 | 0
11-20 | 50*10
21-30 | 60*5
----------------------------
total | 800 <number I need to be returned(ChargeTotal)
So far I have come up with the following, but I'm unsure how to get the total for each loop, or if it is even possible to do more than one FOR statements:
CREATE OR REPLACE FUNCTION ChargeTotal
RETURN FLOAT IS
total FLOAT;
BEGIN
FOR a in 1 .. 10 LOOP
FOR a in 11 .. 20 LOOP
FOR a in 21 .. 30 LOOP
FOR a in 40 .. 1000 LOOP
RETURN Total;
END ChargeTotal;
Ok so take into consideration that right now I have no DB available to test this (there might be some syntax errors etc).
But I am thinking something along this lines of code...
function ChargeTotal(xin number) return number is
cursor mycursor is
select lowLimit,highLimit,charge
from progressive_table order by lowLimit;
sum number;
segment number;
x number;
begin
sum:=0;
x :=xin;
for i in mycursor loop
segment := (i.highLimit-i.lowLimit)+1;
x := greatest ( x - segment,x);
sum := sum + segment*i.charge;
if (x<segment) then
return sum;
end if;
end loop;
return sum;
end;
I think you can do the calculation via single sql without complex function
the logic is:
you have weights for each "band"
calculate the "band" each row
count(*) over to calculate number of rows in each "band"
join your weight table to get sub.total for each band
use rollup to get grand total
sql
select r.num_of_batches
,sum(r.subtotal_charge)
from (
with weights as
(select 1 as num_of_batches, 0 as charge from dual
union all
select 2 as num_of_batches, 50 as charge from dual
union all
select 3 as num_of_batches, 60 as charge from dual
union all
select 4 as num_of_batches, 70 as charge from dual
union all
select 5 as num_of_batches, 80 as charge from dual
)
select distinct n.num_of_batches
, w.charge
, count(*) over (partition by n.num_of_batches) as cnt
, count(*) over (partition by n.num_of_batches) * charge as subtotal_charge
from (
select num, case when floor(num / 10) > 4 then 5 else floor(num / 10)+1 end as num_of_batches
from tst_brances b
) n
inner join weights w on n.num_of_batches = w.num_of_batches
order by num_of_batches
) r
group by ROLLUP(r.num_of_batches)
populate test data
create table tst_branches(num int);
declare
i int;
begin
delete from tst_brances;
for i in 1..10 loop
insert into tst_brances(num) values (i);
end loop;
for i in 11..20 loop
insert into tst_brances(num) values (i);
end loop;
for i in 21..25 loop
insert into tst_brances(num) values (i);
end loop;
for i in 31..32 loop
insert into tst_brances(num) values (i);
end loop;
for i in 41..43 loop
insert into tst_brances(num) values (i);
end loop;
for i in 51..55 loop
insert into tst_brances(num) values (i);
end loop;
commit;
end;
results
1 1 0
2 2 500
3 3 360
4 4 140
5 5 640
6 1640
I hope you guys are well.
DATA: The input data is unsorted and hence I am using hash tables to take the input data, do some iterations, sort and then output. Sorting the original table prior to any iterations (using proc sort) would be a time-consuming effort. If there is no other option, then I will need to sit down for the gruesome sorting approach.
What I want: I am trying to enumerate a table variable "answer" with binary values (0/1) if variable filter = "Y" for the next 6 month observations with the same client. In some instances, the client is missing from some monthly observations eg: client FG5151 is missing from September and October 2006. In short if variable filter "Y" then this observation and the next 6 months observations for same client should be assigned variable "answer" eq 1, else 0.
data have;
input client $ dates date9. filter $;
datalines ;
Fg5151 28.Feb.06 N
Fg5151 31.Mar.06 N
Fg5151 30.Apr.06 N
Fg5151 31.May.06 Y
Fg5151 30.Jun.06 N
Fg5151 31.Jul.06 Y
Fg5151 31.Aug.06 N
Fg5151 30.Nov.06 N
Fg5151 31.Dec.06 N
Fg5151 01.Jan.07 N
A101 28.Feb.06 N
A101 31.Mar.06 N
A101 30.Apr.06 Y
A101 31.May.06 N
A101 30.Jun.06 N
A101 31.Jul.06 N
ABC123 31.Mar.06 N
;
data want;
input client $ dates date9. filter $ answer;
datalines ;
A101 28.Feb.06 N 0
A101 31.Mar.06 N 0
A101 30.Apr.06 Y 1
A101 31.May.06 N 1
A101 30.Jun.06 N 1
A101 31.Jul.06 N 1
ABC123 31.Mar.06 N 0
Fg5151 28.Feb.06 N 0
Fg5151 31.Mar.06 N 0
Fg5151 30.Apr.06 N 0
Fg5151 31.May.06 Y 1
Fg5151 30.Jun.06 N 1
Fg5151 31.Jul.06 Y 1
Fg5151 31.Aug.06 N 1
Fg5151 30.Nov.06 N 1
Fg5151 31.Dec.06 N 1
Fg5151 01.Jan.07 N 0
;
I have written both a hash statement and a data step statement. I dont know how to approach this problem:
/* data step approach */
data want;
set have;
retain answer c;
if _n_=1 or lag(client) ne client then do;
answer=0;
c=0;
end;
if filter="Y" then do;
call symput('xdate',dates);
answer=1;
c=1;
end;
else if answer=1 then c=c+1;
if (intnx("month",dates,6,"same")) then do;
answer=0;
c=0;
end;
run;
/* hash method approach */
data _null_;
set have end=last;
if _n_ = 1 then do;
length newdate 8 answer 8 c 8;
format newdate ddmmyy10.;
declare hash hs(ordered: "a",hashexp: 9);
hs.defineKey("client","dates");
hs.defineData("client","dates","filter","answer","c");
hs.defineDone();
end;
rc = hs.find();
by client dates notsorted;
if rc ne 0 then do;
retain answer c;
if _n_=1 or lag(client) ne client then do;
answer=0;
c=0;
end;
if filter="Y" then do;
answer=1;
c=1;
hs.add();
end;
else if answer=1 then c=c+1;
if (intnx("month",dates,6,"same")) then do;
answer=0;
c=0;
hs.replace();
end;
hs.replace();
end;
if last eq 1 then do;
hs.output(dataset:
"not_working");
end;
run;
Any help would be greatly appreciated.
thank you.
regards,
S
One option is PROC FORMAT. This has a sort in it, but only of the filter='Y' folks, so hopefully that's minimal; and it's actually unnecessary if you are confident your data is grouped (but not sorted) by client (ie, you can skip it, it will not delete anything), and in fact with the m option being used anyway (to avoid worrying about collisions) you probably can skip it regardless.
This is not super-fast necessarily, because it uses putn function instead of put statement. You will have to see how it performs on larger datasets.
The idea here is we construct a format that defines the range of 'Y' for each record, and uses hlo='o' option to define the rest of the ragne as n.
data for_fmt;
set have;
by client notsorted;
if filter='Y' then do;
start = dates;
end = intnx('Month',dates,5,'s');
hlo=' m';
fmtname=cats(client,'F');
label='Y';
output;
end;
if last.client then do;
fmtname=cats(client,'F');
call missing(of start end);
hlo='om';
label='N';
output;
end;
run;
proc sort nodupkey data=for_fmt;
by fmtname start;
run;
proc format cntlin=for_fmt;
quit;
data want;
set have;
answer = putn(dates,cats(client,'F'));
run;
Good afternoon,
I would like to define my parameters in my plot as opposed to generating a plot with all values.
For example, I want to show only the sale price of the data not exceeding $400,000. This syntax is not correct, but this is my attempt at it. Should I use the if, by, or where statement in this matter? Thank you!
proc sgplot data=mydata;
loess x = FirstFlrSF y = saleprice / group= OverallQual;
reg x = FirstFlrSF y = saleprice;
where saleprice =< 400000;
title "First Floor SF vs sales price"; run;
IF's don't work in PROCS, but WHERE's do, however you have the comparison operator specified incorrectly. It's <= instead of =<. I always remember the order by saying it out loud, less than or equal to.
proc sgplot data=sashelp.class;
scatter x=height y=weight;
where age <= 15;
run;quit;
The placement of the where statement was not in the correct line.
proc sgplot data=mydata (where =(saleprice <= 400000));
loess x = FirstFlrSF y = saleprice / group= OverallQual;
reg x = FirstFlrSF y = saleprice;
title "First Floor SF vs sales price"; run;
I have an excel file that I imported into SAS that contains 3 variables and 3 observations.
All values are numbers.
24 12 47
99 30 14
50 5 41
Is there a way I can code so that each row is sorted in ascending order?
Result would be:
12 24 47
14 30 99
5 41 50
I need to do this for several excel files that contain huge number of variables and observations.
Thank You.
The simple way is to use CALL SORTN which sorts across rows.
data have;
input a b c;
datalines;
24 12 47
99 30 14
50 5 41
;
run;
data have;
modify have;
call sortn(of _numeric_);
run;
I would use a FCMP sort routine. FCMP functions and subroutines only allow temporary arrays to be passed to them for modification. So you have to assign the values into a temporary array, sort, and then reassign to the permanent variables.
Modify the code below for your number of columns and column names.
options cmplib=work.cmp;
proc fcmp outlib=work.cmp.fns;
subroutine qsort(arr[*],lo,hi);
outargs arr;
i = lo;
j = hi;
do while (i < hi);
pivot = arr[floor((lo+hi)/2)];
do while (i<=j);
do while (arr[i] < pivot);
i = i + 1;
end;
do while (arr[j] > pivot);
j = j - 1;
end;
if (i<=j) then do;
t = arr[i];
arr[i] = arr[j];
arr[j] = t;
i = i + 1;
j = j - 1;
end;
end;
if (lo < j) then
call qsort(arr,lo,j);
lo = i;
j = hi;
end;
endsub;
run;
quit;
data test;
input a b c;
datalines;
24 12 47
99 30 14
50 5 41
;
run;
%let ncol=3;
%let cols = a b c;
data sorted;
set test;
array vars[&ncol] &cols;
/*Only temporary arrays can be passed to FCMP functions*/
array tmp[&ncol] _temporary_;
/*Assign to tmp*/
do i=1 to &ncol;
tmp[i] = vars[i];
end;
/*Sort*/
call qsort(tmp,1,&ncol);
/*Put back sorted values*/
do i=1 to &ncol;
vars[i] = tmp[i];
end;
drop i;
run;
Though there's a package SAS/IML designed specifically for manipulations with matrices (where, I believe, this task would be trivial), it still can be done with SAS Base using a couple of PROCs wrapped into macro loop.
data raw;
input a b c;
datalines;
24 12 47
99 30 14
50 5 41
;
run;
proc transpose data=raw out=raw_t(drop=_:); run;
proc sql noprint;
select name into :vars separated by ' '
from sashelp.vcolumn
where libname='WORK' and memname='RAW_T';
quit;
%macro sort_rows;
%do i=1 %to %sysfunc(countw(&vars));
proc sort data=raw_t(keep=%scan(&vars,&i)) out=column;
by %scan(&vars,&i);
run;
data sortedrows;
%if &i>1 %then set sortedrows;;
set column;
run;
%end;
%mend sort_rows;
%sort_rows
proc transpose data=sortedrows out=sortedrows(drop=_:); run;
First, you transpose your original dataset.
Then you iterate through all columns (which were rows originally) one by one, sorting them and right-joining to each other.
And finally, transpose everything back.