I'm following basic instructions to replicate a Proc Surveyselect step; the instructions include the seed used but it seems depending on the sort order results can still vary (and the instructions don't include anything about sorting). I've been reviewing online documentations and all of the information I'm finding related to sorting and Proc Surveyselect is when a strata is specified. Does sort order impact the results of Proc Surveyselect when a seed is specified (and no strata statements are included)?
proc surveyselect data = dataset groups=10 seed=123456 out=dataset_out; run;
Let's test it out. The answer is yes, but what better way to learn than to test it yourself?
data class;
set sashelp.class;
run;
/* Create 10 random groups from class */
proc surveyselect
data = class
groups = 10
seed = 123456
out = class_sample;
run;
/* Resort the data */
proc sort data=class;
by age;
run;
/* Create 10 random groups from the sorted data */
proc surveyselect
data = class
groups = 10
seed = 123456
out = class_sample_sorted;
run;
/* Resort both datasets back by name */
proc sort data=class_sample;
by name;
run;
proc sort data=class_sample_sorted;
by name;
run;
/* Compare the difference between assigned groups */
proc compare base = class_sample
compare = class_sample_sorted
;
id name;
var groupid;
run;
The results from proc compare show that dataset order does indeed matter and will give different sampling results.
Related
I am a beginner with SAS and trying to create a table with code below. Although the code has been running for 3 hours now. The dataset is quite huge (150000 rows). Although, when I insert a different date it runs in 45 mins. The date I have inserted is valid under date_key. Any suggestions on why this may be/what I can do? Thanks in advance
proc sql;
create table xyz as
select monotonic() as rownum ,*
from x.facility_yz
where (Fac_Name = 'xyz' and (Ratingx = 'xyz' or Ratingx is null) )
and Date_key = '20000101'
;
quit;
Tried running it again but same problem
Is your dataset coming from an external database? A SAS dataset of this size should not take nearly this long to query - it should be almost instant. If it is external, you may be able to take advantage of indexing. Try and find out what the database is indexed on and try using that as a first pass. You may consider using a data step instead rather than SQL with the monotonic() function.
For example, assume it is indexed by date:
data xyz1;
set x.facility_xyz;
where date_key = '20000101';
run;
Then you can filter this final dataset within SAS itself. 150,000 rows is nothing for a SAS dataset, assuming there aren't hundreds of variables making it large. A SAS dataset this size should run lightning fast when querying.
data xyz2;
set xyz1;
where fac_name = 'xyz' AND (Ratingx = 'xyz' or Ratingx = ' ') );
rownum = _N_;
run;
Or, you could try it all in one pass while still taking advantage of the index:
data xyz;
set x.facility_xyz;
where date_key = '20000101';
if(fac_name = 'xyz' AND (Ratingx = 'xyz' or Ratingx = ' ') );
rownum+1;
run;
You could also try rearranging your where statement to see if you can take advantage of compound indexing:
data xyz;
set x.facility_xyz;
where date_key = '20000101'
AND fac_name = 'xyz'
AND (Ratingx = 'xyz' or Ratingx = ' ')
;
rownum = _N_;
run;
More importantly, only keep variables that are necessary. If you need all of them then that is okay, but consider using the keep= or drop= dataset options to only pull what you need. This is especially important when talking with an external database.
What kind of libname to you use ?
if you are running implicit passthrough using sas function, it would explain why it takes so long.
If you are using sas/connect to xxx module, first add option to understand what is going on : options sastrace=,,,d sastraceloc=saslog;
You should probably use explicit passthrough : using rdbms native language to avoid automatic translation of your code.
I am actually new to SAS and would like form portfolios between the intersection of 2 variables from my spreadsheet.
Basically, I have an excel file called 'Up' with variables in it like 'month, company, BM, market cap usd)
I would like to sort for each month my data: the size (descending) and then BM (descending). I would like to create 4 size portfolios according to P25, P50 and P75 with the first size portfolio being above P75 (for each month) and so on. Then for each size portfolio that was create recreating 4 new portfolios in function of 'BM' and also with P25, P50, and P75.
Could someone help me and display me the SAS code and the way to add it to my existing 'Up' file (name of the sheet is also named 'up')
So I agree with the comment, this is not asked well. However, it is a common problem to solve and somewhat fun. So here goes:
First I'm going to just make up some data. Google search how to read Excel in SAS. It's easy.
1000 companies with a random SIZE and BM value.
data companies(drop=c);
format company $12.;
do c=1 to 1000;
company = catt("C_",put(c,z4.));
size = ceil(100*ranuni(1));
BM = ceil(100*ranuni(1));
output;
end;
run;
So I'm assuming you just want equal amounts in these 4 groups. You don't want to estimate percentiles based on a distribution or KDE. For this, PROC RANK works well.
proc rank data=companies out=companies descending groups=4;
var size;
ranks p_size;
run;
We now have a variable P_SIZE that is values 0,1,2,3 based on the descending order of SIZE.
Sort the portfolios by that P_SIZE value.
proc sort data=companies;
by p_size;
run;
Now run PROC RANK again, this time using a BY statement with P_SIZE, ranking on BM, and creating P_SIZE_BM.
proc rank data=companies out=companies descending groups=4;
var bm;
by p_size;
ranks p_size_bm;
run;
P_SIZE_BM now contains values 0,1,2,3 for EACH value of P_SIZE.
Sort the data and see how it comes out:
proc sort data=companies;
by p_size p_size_bm;
run;
I am trying to select the top 10 exposures for each class of business out of a large data set.
Below is an example of the dataset.
dataset example
If I were to need the top 10 exposures then I would simply sort by exposure descending (as I have done) and use the (obs = 10) command.
However I require the top 10 for each LOB.
Do you know how I could do this in SAS?
Thanks!
I would create a counting dummy variable, counting the number of exposures per lines of business and then delete any observation for which the dummy variable exceeds 10.
This can be done in a single datastep (given that the data is properly sorted) by (ab-)using that SAS code runs top to bottom.
proc sort data = have out=temp; by lob descending exposure; run;
data want(drop=countlob);
retain countlob;
set temp;
by lob;
countlob = countlob + 1;
if first.lob then countlob = 1;
if countlob > 10 then delete;
run;
I have two data sets :
"mothers" - 5,512 observations where the variable "MOTHER" = 1
"all_women" - 2,336,750 observations where the variable "MOTHER" = 0
I combine the two as follows:
data combined;
set mothers all_women;
Now as the mothers are already in the datset all_women, I want to delete the repeated entries - with the condition that I keep the observations where "MOTHER"=1.
I tried the following:
proc sort data=combined; by ID DESCENDING MOTHER; run;
proc sort data=combined nodupkeys; by ID; run;
yet I lose some of the mothers because I am left with only 5458 observations where "MOTHER"=1. What have I done to introduce this error?
Instead of using NODUPKEY, use FIRST./LAST. processing.
proc sort data=combined;
by ID DESCENDING MOTHER;
run;
data want;
set combined;
by ID descending mother;
if not (first.ID) and (mother=0) then delete;
run;
That would keep any ID that had mother=0 only (keep 1 record per), and would keep all mother=1 rows.
Have you checked whether there were any duplicate IDs in the mothers dataset? The second proc sort would have eliminated those rows.
You can check like so:
proc sort data = mothers nodupkey out = mothers_dedup dupout = mothers_dups;
by ID;
run;
If mothers_dups contains more than 0 observations, this might account for the problem.
Say I have a dataset for a supermarket, with a product category, price product name etc. I want to sort by the category but with a defined order as opposed to alphabetically.
For instance if the categories are: canned, dairy, meat, vegetable, and I want to sort by when they may expire (I'll it's likely we'd have that information, so just play along please). This means I want to sort in this order: Dairy, Meat, Vegetable, Canned.
I wrote a macro with this signature:
key_sort(ds=, keys='canned, dairy, meat, vegetable', field =category, sort_by=)
This parses the keys so that the can be put in a macro loop, then I use that much loop to write out a select statement like so:
Select(&field. ) ;
%do i=1 %to &number_of_keys. ;
%let current_key= %scan(&keys., &I., &delim.) ;
When(¤t_key. ) &field._key=&i. ;
%end;
End;
I then sort by the &field._key
Is this the best method to take? Can this be done more succinctly or efficiently?
If you had a separate dataset that contained the ordering then you could utilise that and provide it as input a PROC FORMAT, or apply one that's only a few bits as simply as possible.
proc format;
value $EXPIRES
'canned'= 4
'dairy' = 1
'meat' = 2
'vegetable' = 3
other = 5
;
run;
proc sql;
CREATE TABLE output_set AS
SELECT * FROM foods ORDER BY put(produce_type, $expires.);
quit;